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8.9  Construction  of  an  ESC  with  N  =  8  and  4x4  switching 

elements . 

9.1  Two  8-adjacent  Is  and  two  non-4- adjacent  Os . . . 

9.2  (a)  Data  allocation  for  a  R  x  R  image  using  N  PEs. 

(b)  Data  transfers  needed  to  apply  Sobel  edge  operator . 

9.3  Sobel  operator  algorithm  defined  for  a  subimage . 

9.4  Parallel  algorithm  for  EGT  at  PEj . 

9.5  (a)  Naming  convention  for  the  neighbors  of  the  center  pixel  in 

a  3  X  3  window,  (b)  Example  showing  start  of  tracing . 

9.6  Example  of  Phase  1  contour  tracing  for  a  10  x  20  image.  The 

triple  (i,x,y)  represents  the  i-x-y  coordinates  of  the  pixel . 

9.7  Contours  found  by  Phase  I  row  scans  for  a  30  x  20 

sample  image . 

9.8  Contours  found  by  the  Phase  I  column  scan  for  the  image 

of  Figure  9.7 . 

9.9  Example  to  illustrate  Phase  II  activity,  protocol,  and 

mechanisms.  End  point  coordinates  are  given  where  (i,x,y) 
represents  the  i-x-y  coordinates  of  the  pixel . 

9.10  Section  of  a  poorly  illuminated  circuit  board  overlayed  with  a 

16  X  16  pixel  grid . 


0.11  A  binary  image  obtained  by  thresholding  Figure  0.10  with  a 
single  threshold  of  153 . 

0.12  The  EGT  merit  value  graphs  for  the  64  subimages  in  the  upper 
left  corner  of  Figure  0.10.  The  horizontal  axis  for  each  graph 
is  gray  value  (0  to  255)  and  the  vertical  axis  is  the  threshold 
merit  value  (0  to  32) . . 

0.13  Binary  image  resulting  from  EGT-based  segmentation  of 

Figure  0.10 . 

0.14  Contours  extracted  from  Figure  0.10 . . 
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The  demand  for  very  high  speed  data  processing  coupled  with  falling 
hardware  costs  has  made  large-scale  parallel  and  distributed  computer  systems 
both  desirable  and  feasible.  Two  modes  of  parallel  processing  are  single 
instruction  stream  -  multiple  data  stream  (SIMD)  and  multiple  instruction 
stream  -  multiple  data  stream  (MIMD).  PASM,  a  partitionable  SIMD/MIMD 
system,  is  a  reconfigurable  multimicroprocessor  system  being  designed  for 
image  processing  and  pattern  recognition.  An  important  component  of  these 
systems  is  the  interconnection  network,  the  mechanism  for  communication 
among  the  computation  nodes  and  memories.  Assuring  high  reliability  for  such 
complex  systems  is  a  signiGcant  task.  Thus,  a  crucial  practical  aspect  of  an 
interconnection  network  b  fault  tolerance. 

In  answer  to  thb  need,  the  Extra  Stage  Cube  (ESC),  a  fault-tolerant, 
multistage  cube-type  interconnection  network,  b  defined.  The  fault  tolerance 
of  the  ESC  b  explored  for  both  single  and  multiple  faults,  routing  tags  are 
defined,  and  consideration  b  given  to  permuting  data  and  partitioning  the  ESC 
in  the  presence  of  faults.  The  ESC  b  compared  with  other  fault-tolerant 
multistage  networks.  Finally,  reliability  of  the  ESC  and  an  enhanced  version 


of  it  are  investigated,  f 


A  knowledge  of  the  performance  of  various  switching  element  designs  is 
important  to  the  engineering  of  interconnection  networks.  Typically,  networks 
proposed  for  parallel  systems  have  been  designed  w’ith  two-input/two-output 
switches.  VLSI  technology  allows  implementation  of  complex  circuits  as  a 
single  device.  The  performance  of  four-input/four-output  switches  under 
various  message  loading  conditions  is  analyzed  and  their  use  in  the  ESC 
considered. 

Finally,  a  parallel  digital  image  processing  scenario  for  implementation  on 
a  computer  system  such  as  PASM  is  analyzed.  Contour  extraction  is  chosen  as 
the  focus  because  it  is  a  key  step  in  many  applications  and  presents  a 
multifaceted  challenge  to  a  parallel  computer.  Issues  studied  include  parallel 
formulation  of  the  constituent  algorithms,  mapping  the  algorithms  and  sizing 
the  machine,  quality  of  results,  and  implications  for  network  design  and  system 
architecture. 
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CHAPTER  1 
INTRODUCTION 

1.1  Motivation 

Consider  the  experience  of  the  Boeing  Commercial  Airplane  Company  in  a 
project  to  improve  the  propulsive  efficiency  of  the  Boeing  737  aircraft  [RuT83, 
TiC84].  The  737  was  originally  designed  with  low-bypass  ratio  turbojet  engines 
mounted  below  the  wings.  Advances  in  technology  have  led  to  high-bypass 
ratio  turbofan  engines,  which  are  more  fuel  efficient  than  the  turbojet  engines 
they  replace,  yet,  are  inherently  larger  in  diameter. 

Conventional  aerospace  engineering  practice  prescribes  a  certain  separation 
between  an  engine  nacelle,  or  housing,  and  wing  so  as  to  avoid  excessive  drag 
due  to  airflow  interference  between  the  two  structures.  Figure  1.1(a)  shows  a 
generalized  wing  and  nacelle  with  the  relevant  dimensions  indicated.  Figure 
1.1(b)  shows  the  state  of  the  art  concerning  nacelle  installation,  achieved 
through  years  of  wind  tunnel  testing.  Aircraft  designers  found  that  a  nacelle 
positioned  so  as  to  be  above  the  dotted  line  in  Figure  1.1(b)  gave  rise  to 
excessive  drag  (three  to  five  percent  of  the  total  aircraft  drag).  The  precise 
nature  of  the  drag  was  unknown.  Thus,  the  dotted  line  represents  both  a 
family  of  closest  installations  for  nacelle  and  wing,  based  upon  conventional 
design  as  supported  by  wind  tunnels,  and  a  baseline  from  which  to  measure 
new  design  techniques. 


Figure  1.1  (a)  Generalized  wing  and  engine  nacelle  showing  relevant 

dimensions  to  determine  engine  nacelle  installation,  (b)  Plot  of 
nacelle  installations  for  a  number  of  transport  jet  aircraft 


Using  coDventiona]  design  for  the  737  engine  refit  project  results  in  the 
high-bypass  engines  contacting  the  runway  (see  Figure  1.2)  due  to  their  large 
diameter  and  the  prescribed  separation  between  wing  and  nacelle.  To 
eliminate  this  problem  the  landing  gear  would  have  to  be  lengthened,  requiring 
redesign  and  sharply  reducing  the  possible  attractiveness  of  the  refii  to  owners 
of  existing  737  aircraft  due  to  the  additional  expense  to  install  new  landing 
gear.  Further,  the  longer  landing  gear  would  have  a  greater  weight,  offsetting 
some  of  the  fuel  efficiency  gain  of  the  new  engines. 

The  relatively  recent  availability  of  sufficiently  powerful  computers  made 
study  of  engine/wing  interference  drag,  via  aerodynamic  simulation,  feasible. 
This  research  revealed  the  source  of  the  drag  ldaC78]  and  the  design  solution: 
proper  choice  of  the  shape  of  the  nacelle  and  nacelle  support  strut.  Computer 
simulation  guided  the  designers  in  the  737  project  to  nacelle  and  strut  shapes 
that  afforded  acceptably  low  drag  and  achieved  adequate  ground  clearance  with 
the  existing  landing  gear  while  being  compatible  with  housing  the  engine  and 
its  accessories  (see  Figure  1.2).  An  additional  benefit  of  the  close-coupled 
installation  b  the  reduced  size,  and  hence,  weight,  of  the  nacelle  support  strut. 

Flight  testing  this  year  of  the  refitted  737  (known  as  the  737-300)  has 
shown  the  interference  drag  due  to  the  nacelle  installation  to  be  much  less  that 
one  percent  of  total  aircraft  drag  [Tin84j.  Figure  1.3  shows  the  relationship  of 
the  computationally-derived  nacelle  installation  for  the  737-300  (as  well  as  for 
similar  projects  for  the  707,  757,  and  767  aircraft)  to  that  possible  with 
conventional  wind  tunnel  methodology.  It  also  shows  the  range  of  nacelle  and 
strut  shapes  u.sed.  These  computationally  derived  nacelle  shapes  and 
installations  lie  above  the  dotted  line  shown  in  the  plot  of  Figure  1.3  indicating 
a  close  installation,  yet  they  do  not  incur  high  drag. 


NACELLE  POSITIONED  ON  THE  WIND  TUNNEL 
BASELINE  BOUNDARY 


Position  of  nacelle  using  the  best  (baseline)  wind  tunnel 
technology  and  a  computationally-derived  installation  close- 
coupled  with  the  wing  jRuT83j. 
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Computationally  Derived  Close-Coupled  Nacelle  Positions 
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Figure  1.3  Plot  of  several  computationally-derived  nacelle  installations  in 
comparison  to  wind  tunnel  methodology  along  with  a  sketch  of 
each  (RuT83j. 
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The  alternative  to  computer  simulation,  empirical  efforts  to  develop  low 
drag  nacelle  installations  via  wind  tunnel  experimentation,  has  not  been  too 
successful  [RuT83|  for  several  reasons.  One  reason  is  that  wind  tunnels  readily 
provide  information  about  total  drag,  but  not  the  drag  due  to  individual 
airframe  components.  Another  is  that  it  is  not  always  possible  to  measure 
wind  tunnel  air  flow  phenomena  in  fine  detail.  The  development  of  the  737-300 
would  not  have  been  accomplished  without  high-speed  computers. 

High-speed  computation  is  vital  to  the  future  of  many  human  endeavors  in 
addition  to  aircraft  design.  Continued  progress  in  such  disciplines  and 
activities  as  theoretical  physics  and  chemistry,  flight  simulation,  fusion  energy 
research,  image  generation  and  processing,  integrated  circuit  design  and 
simulation,  hydrorsrbon  exploration  and  reservoir  modeling,  continuous  speech 
recognition,  structural  analysis,  and  weather  forecasting  depends  upon  the 
continuing  availability  of  yet  faster  computers.  There  are  important  problems 
in  each  of  these  areas  that  can  neither  be  feasibly  solved  using  available 
computers  nor  be  solved  without  computers. 

Modern  high-performance  computers  deliver  up  to  roughly  several  hundred 
million  floating  point  operations  per  second  (MFLOPS)  when  executing  actual 
applications  programs;  speeds  typically  average  more  nearly  10  to  50  MFLOPS. 
Careful  programming  can  increase  these  achieved  rates,  but  only  by  a  factor  of 
two  to  four.  Over  the  next  ten  years,  a  computing  performance  increase  by  a 
factor  near  one  thousand  will  be  needed  to  sustain  normal  progress  in  many 
disciplines  (AdD84].  This  computational  need  cannot  be  dismissed  as  the  result 
of  inadequate  algorithm  design.  Continued  progress  with  algorithms  will 
reduce  the  execution  time  of  many  problems,  but  faster  hardware  will  still  be 
essential.  The  demand  for  increased  computational  power  is  growing  rapidly 
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and  will  continue  to  grow  for  the  long  term  future. 


The  uniprocessor  design  of  computers  b  today  approaching  fundamental 
physical  limits  to  ultimate  performance.  While  circuit  switching  speeds  and 
circuit  packing  density  will  continue  to  improve,  the  impressive  hbtoric  rate  of 
increase  is  likely  to  slow.  It  b  unlikely  that  uniprocessor  technology  will 
provide  sustained  speeds  beyond  1000  MFLOPS  in  thb  century.  Yet,  thb  b 
the  realm  of  computational  power  that  b  needed.  An  opportunity  to  achieve 
greater  processing  speeds  lies  with  computer  systems  employing  multiple 
processing  elements  acting  together.  The  truly  large  gains  will  be  made  if 
hundreds  and  even  thousands  of  processors  can  be  made  to  work  effectively  in 
concert. 

The  demand  for  very  high  speed  computing  coupled  with  falling  hardware 
costs  has  made  large-scale  parallel  and  dbtributed  computer  systems  both 
desirable  and  feasible.  An  important  component  of  these  systems  b  the 
interconnection  network,  the  mechanbm  for  information  transfer  among  the 
computation  nodes  and  memories.  Assuring  high  reliability  for  such  complex 
systems  b  a  significant  task.  Thus,  a  crucial  practical  aspect  of  an 
interconnection  network  is  fault  tolerance.  Study  of  a  fault-tolerant  multbtage 
interconnection  network  b  one  of  three  topics  in  thb  work 

A  knowledge  of  the  performance  of  various  network  switching  element 
designs  b  important  to  the  engineering  of  interconnection  networks.  VLSI 
technology  opens  up  wider  possibilities  for  network  implementation  by  allowing 
more  complex,  sophbticated  switching  elements.  An  analysis  of  alternative 
switching  element  designs  suited  for  VLSI  implementation  b  a  second  area 
addressed  in  this  work.  The  switching  elements  investigated  could  could 
replace  the  simpler  ones  used  in  numerous  interconnection  networks,  including 
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the  one  studied  here,  lowering  cost  and/or  improving  network  performance. 

Finally,  the  interaction  of  application  programs  and  parallel/distributed 
systems  is  a  fundamental  question.  Detailed  investigation  of  a  particular 
application  from  a  problem  domain  can  guide  many  aspects  of  system  design  as 
well  as  illuminate  the  issue  of  mapping  the  application  to  the  machine.  The 
third  aspect  of  this  work  is  an  investigation  of  digital  image  contour  extraction 
using  a  parallel  computer. 

1.2  Overview 

The  approach  taken  in  this  work  attempts  to  be  cognizant  of  engineering 
considerations  for  building  a  parallel  computer  system.  For  example,  not  only 
must  an  interconnection  network  have  desirable  theoretical  properties,  these 
properties  should  be  feasible  to  attain  and/or  use.  The  intent  is  to  attain 
useful  knowledge  for  design  of  parallel  and  distributed  computer  systems. 

Chapter  2  surveys  five  existing  or  proposed  parallel  processing  systems 
that  use  multistage  interconnection  networks  in  their  designs.  These  systems 
could  potentially  incorporate  the  fault-tolerant  interconnection  network 
developed  and  studied  in  later  chapters.  The  information  presented  in  this 
chapter  is  intended  to  provide  a  context  in  which  such  networks  can  be  viewed. 

The  fault- tolerant  interconnection  network  that  is  the  focus  of  much  of 
this  work  is  the  Extra  Stage  Cube  (ESC),  which  derives  from  the  Generalized 
Cube  network.  Chapter  3  is  the  first  of  four  chapters  to  deal  with  the  ESC 
network.  In  it  the  Generalized  Cube  network  is  defined  and  its  basic  properties 
stated,  then  the  ESC  is  defined. 
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Properties  of  the  ESC  are  set  forth  in  Chapter  4.  Its  single  fault  tolerance 
is  established  and  capacity  for  coping  with  multiple  faults  studied.  Routing 
tags  for  network  control  under  both  fault- free  and  faulted  conditions  are 
defined.  Partitioning  the  network  and  permuting  data  using  it  are  also 
discussed. 

Chapter  5  reviews  the  state  of  the  art  in  fault-tolerant  multistage 
interconnection  networks.  The  characteristics  of  these  networks  are  described 
and  the  nature  of  their  fault  tolerance  discussed.  Each  network  is  compared 
with  the  ESC. 

With  the  fault-tolerant  capabilities  of  the  ESC  determined,  Chapter  6 
presents  an  analysis  of  ESC  reliability.  Reliability  is  measured  as  the 
probability  that  there  exists  at  least  one  path  between  any  network  input  and 
output.  A  exact  solution  for  the  case  of  two  faults  is  developed  based  on  the 
ESC  fault  model  stated  in  Chapter  4. 

Consideration  of  the  results  in  Chapter  6  shows  possible  areas  for 
improvement  in  the  ESC  design  and  operational  protocol.  In  Chapter  7,  an 
enhancement  of  the  basic  ESC  topology  is  described.  This  provides  increased 
fault  tolerance  with  only  a  small  increase  in  system  hardware  complexity;  no 
logic  need  be  added  to  the  ESC.  The  eflfect  of  the  enhancement  on  ESC 
properties  and  reliability  b  investigated.  Large  networks  are  shown  to  benefit 
more  than  small  ones  from  this  modification. 

Chapter  8  presents  a  performance  analysis  and  comparison  of  two 
switching  elements  suitable  for  use  in  the  ESC  and  many  other  interconnection 
networks.  These  switching  elements  are  alternatives  to  the  traditional 
interchange  box  and  are  suited  to  very  large-scale  integration  (VLSI) 
manufacture. 
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PASM  is  a  parallel  computer  system  intended  to  be  suited  for  image 
processing  and  pattern  recognition  tasks.  To  insure  that  the  design 
architecture  can  meet  this  goal,  it  is  useful,  probably  even  necessary,  to 
consider  selected  image  processing  tasks  to  learn  what  is  required  of  system 
hardware  for  their  execution  in  a  manner  satisfactory  to  users.  In  Chapter  0  a 
scenario  for  image  contour  extraction  based  on  parallel  computation  is 
developed  and  used  to  explore  the  advantages  of  parallel  computation  and 
issues  in  parallel  computer  design.  Parallel  forms  of  edge-guided  thresholding 
and  contour  tracing  algorithms  are  constructed  and  analyzed  to  highlight 
important  aspects  of  the  scenario.  The  implications  that  the  scenario  has  for 
parallel  computer  architecture  are  considered,  including  interconnection 
network  design.  Various  important  system  attributes  are  identified  and 
described. 

The  remaining  sections  of  this  chapter  present  the  basic  vocabulary  and 
definitions  needed  in  subsequent  chapters.  Descriptions  of  the  major  parallel 
computer  architecture  classes  and  some  of  their  important  characterbtics  are 
included.  F undamental  interconnection  network  terminology  is  defined. 


1.3  Parallel  Computer  System  Terminology 

An  SIMD  (Single  Aistruction  Stream  -  Jifiiltiple  Data  Stream)  [Fly66| 
machine  typically  consists  of  a  control  unit,  N  processors,  N  memory  modules, 
and  an  interconnection  network,  (e.g.,  Illiac  IV  (BDM72]).  The  control  unit 
broadcasts  instructions  to  all  of  the  processors,  and  all  active  processors 
execute  the  same  instruction  at  the  same  time.  Thus,  there  is  a  single 
instruction  stream.  Each  active  processor  executes  the  instruction  on  data  in 
its  own  associated  memory  module.  Hence,  there  are  multiple  data  streams. 
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The  interconnection  network,  sometimes  referred  to  as  an  alignment  or 
permutation  network,  provides  a  communications  facility  for  the  processors  and 
memory  modules  [Sie70b,  Sie85].  Illiac  IV  [6DM72]  and  the  Massively  Parallel 
Processor  (MPP)  [BatSO]  are  examples  of  SIMD  systems. 

Many  architectural  variations  are  possible  within  the  class  of  SIMD 
machines.  One  subclass  of  SIMD  machines  are  those  with  the  proeeMtng 
elemeriMo-proetaaing  element  (PE-to-PE)  organization.  Figure  1.4  shows  this 
type  of  architecture.  In  this  scheme,  each  processor  is  paired  with  a  memory 
module  to  form  a  proeesaing  element  (PE).  The  interconnection  network  need 
only  support  unidirectional  information  transfer,  as  each  PE  has  access  to  both 
a  network  input  and  output  to  transmit  and  receive  information,  respectively. 
The  network  may  be  able  to  connect  a  given  PE  to  all  or  a  subset  of  the  other 
PEs.  If  two  PEs  cannot  be  directly  connected  through  the  network,  then 
indirect  transfer  of  information  through  one  or  more  PEs  is  necessary.  Each 
successive  transfer  is  accomplished  by  an  additional  pass  through  the  network. 
The  Illiac  IV  computer  used  the  PEJ-to>PE  organization  and  had  an 
interconnection  network  that  allowed  direct  connection  between  a  given  PE 
and  only  its  four  nearest  neighbors,  where  PEs  were  arranged  in  a  square  mesh 
fashion  (BBK68,  BDM72I. 

Another  SIMD  machine  subclass  fa  the  proeeaaor-to-memory  (P~lo-M) 
structure,  which  utilizes  an  interconnection  network  placed  between  the 
processors  and  memories.  Figure  1.5  shows  this  architecture.  Note  that  the 
number  of  processors  and  memories  need  not  be  the  same.  The  interconnection 
network  must  support  bidirectional  information  transfer  in  these  machines.  A 
memory  module  fa  said  to  be  common  to  two  processors  if  both  can  access  it 
directly  through  the  network.  Two  processors  can  thus  communicate  directly 
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through  a  memory  module  they  have  in  common.  If  no  such  memory  module 
exists,  information  can  be  passed  by  intermediate  processors  through 
intermediary  memory  modules.  The  P-to-M  structure  b  used  in  the  Texas 
Reconfigurable  Array  Computer  (TRAC)  (KPL80,  PKM80,  SUKSOj;  the 
number  of  memories  b  larger  than  the  number  of  processors  for  thb  machine. 

An  MSIMD  (multiple-5/MD)  machine  b  a  parallel  processing  system  which 
can  be  dynamically  reconfigured  to  operate  as  one  or  more  independent  SIMD 
machines  of  potentially  various  numbers  of  processors  and  memory  modules. 
Each  independent  SIMD  machine  b  a  partition  of  the  MSIMD  system  and  b 
referred  to  as  a  virtual  SIMD  machine.  An  MSIMD  system  typically  consbts  of 
N  processors,  N  memory  modules,  an  interconnection  network,  and  Q  control 
units,  where  1  <  Q  <  N.  Proposed  systems  capable  of  MSIMD  mode  operation 
include  MAP  (Nut77]  and  PASM  [SSK81).  Illiac  IV  was  originally  intended  to 
be  an  MSIMD  system  (BBK68).  MSIMD  systems  can  be  of  either  the  PEJ^to-PE 
or  P-to-M  type. 

Possible  advantages  of  a  MSIMD  system  relative  to  an  SIMD  system  with 
a  similar  number  of  PEs  include  the  following  [SSK81). 

1.  Fault  detection  •  For  situations  requiring  high  reliability,  three  or  more 
partitions  can  process  the  same  data  identically  and  compare  results. 

2.  Fault  tolerance  -  If  a  single  PE  faib,  only  those  virtual  SIMD  machines 
(partitions)  that  include  the  failed  PE  are  affected.  The  rest  of  the  system 
can  continue  to  function  as  before. 


3.  Multiple  simultaneous  users  -  Since  there  can  be  multiple  independent 
virtual  SIMD  machines,  there  can  be  multiple  simultaneous  users  without 
having  to  support  multitasking. 

4.  Program  development  -  Rather  than  debugging  an  SIMD  program  on  the 
entire  system,  a  user  can  select  a  small  partition  for  program  testing. 

5.  Efficiency  -  If  a  task  requires  only  a  subset  of  the  available  PEs,  the  other 
PEs  can  be  used  for  a  different  task. 

6.  Subtask  parallelism  •  Two  or  more  independent  SIMD  subtasks  that  are 
part  of  the  same  job  can  be  executed  in  parallel,  sharing  results  if  necessary. 

An  MIMD  (A/ultiple  Aistruction  Stream  •  Afultiple  /)ata  Stream)  machine 
(Fly 66]  typically  consists  of  N  processors  and  M  memories,  where  each 
processor  can  execute  an  independent  instruction  stream.  As  with  SIMD 
architectures,  there  are  multiple  data  streams  and  an  interconnection  network. 
Thus,  there  are  N  independent  processors  which  can  communicate  among 
themselves.  These  systems  can  be  organized  as  either  PE-to-PE  or  P-to-M. 
There  may  be  a  coordinator  unit  to  help  orchestrate  processor  activity.  Cm* 
|SFS77j  and  C.mmp  (WuB72]  are  two  MIMD  systems  that  have  been 
constructed. 

A  partitionabU  SIMD/ MIMD  machine  is  a  parallel  computer  that  can  be 
reconfigured  as  one  or  more  independent  virtual  SIMD  and/or  MIMD  machines 
of  various  sizes  [SMM70|.  Such  a  machine  has  the  same  structure  as  an 
MSIMD  system,  but  the  processors  are  capable  of  fetching,  decoding,  and 


executing  their  own  instruction  streams  in  addition  to  acting  on  an  instruction 
stream  from  a  control  unit.  Thus,  each  partition  that  can  be  formed  by  the 
system  can  function  as  either  a  virtual  SIMD  or  MIMD  machine,  and  the  mode 
of  operation  within  a  partition  can  change  over  time.  PASM  (SSK81]  and 
TRAC  |KPL80,  PKM80,  SUK80]  are  examples  of  partitionable  SIMD/MIMD 
systems. 

The  processors,  memories,  and  interconnection  network  of  a  partitionable 
SIMD/MIMD  computer  can  be  organized  in  various  ways,  as  discussed  for 
SIMD  machines.  PASM  uses  the  PE-to-PE  organization,  while  TRAC  uses  P- 
to-M.  The  advantages  of  an  MSIMD  system  relative  to  SIMD  machines  are 
available  with  partitionable  SIMD/MIMD  systems.  It  is  possible  to  switch 
between  SIMD  and  MIMD  modes  to  best  accommodate  successive  algorithms  or 
successive  phases  of  one  algorithm. 

1.4  Interconnection  Network  Terminology 

An  int ere onnee lion  network  is  a  device  designed  to  provide  high-speed 
communication  among  a  set  of  processors  that  typically  are  physically  close 
and  more  or  less  closely  coupled  in  operation.  It  is  a  key  element  in  computers 
of  each  of  the  parallel  computer  classes  discussed  in  Section  1.2.  For 
convenience,  the  term  “network”  will  often  be  substituted  for  the  more 
cumbersome  “interconnection  network”  in  the  following.  Networks  are 
comprised  of  a  collection  of  switches,  or  twitching  elements,  and  links,  or  wires. 
Switching  elements  are  also  referred  to  as  nodes.  A  network  may  consist  of  a 
single  stage,  or  bank,  of  switches,  or  multiple  stages  connected  by  links.  A 
single  stage  network  utilizes  only  one  stage  of  switches,  and  data  sent  by  a 
device  using  the  network  may  have  to  pass  through  the  network  repeatedly  to 


reach  its  intended  destination.  A  multistage  network  is  constructed  from  two 
or  more  stages  of  switches,  and,  typically,  data  can  be  sent  to  the  desired 
destination  via  one  pass  through  the  network. 

The  hardware  complexity  of  an  interconnection  network  is  a  measure  of 
the  number  of  components  required  for  its  construction.  This  parameter  is 
frequently  used  to  make  general  comparisons  between  networks  because  it  gives 
an  approximate  indication  of  relative  implementation  cost.  An  asymptotic 
complexity  measure  is  typically  used  for  this  purpose  since  it  can  clearly 
express  basic  trends.  Let  f  be  a  function  representing  the  number  of  network 
components  as  a  function  of  network  size.  Then,  f(x)  is  of  order  g(x)  (written 
0(g(x))  )  if  there  exist  constants  c  and  X0  such  that 

f(x)  <  cg(x) 

for  X  >  Xo  [Knu76,  AHU74]. 

A  network  that  supports  information  flow  in  only  one  direction  through 
the  stage(s)  is  unidirectional.  Each  device  using  such  a  network  must  have 
access  to  both  a  network  input  and  output  in  order  for  communication  between 
two  devices  to  be  possible.  Network  inputs  and  outputs  are  generically  referred 
to  as  ports.  For  a  unidirectional  multistage  network,  the  stage  to  which  input 
ports  are  connected  is  the  input  stage,  and  the  stage  connected  to  the  output 
ports  is  the  output  stage.  A  bidirectional  network  allows  data  to  pass  through 
the  stage(s)  in  either  direction.  There  is  no  distinction  of  network  ports  vis-a- 
vis  the  classes  input  and  output  for  bidirectional  networks;  a  device  need  have 
access  to  onl>  one  port  to  communicate  in  this  case. 

The  topology  of  a  network  is  the  pattern  of  connections  in  the  structure  of 
the  network.  It  is  determined  by  the  nature  of  the  switching  elements,  the 


connections  between  network  ports  and  switching  elements,  and  the 
connections  between  stages  of  switching  elements  (for  multistage  networks). 
Network  topology  is  often  used  to  compare  different  networks  [McS82b,  Thu74, 
WuFSO].  This  is  because  such  comparisons  are  independent  of  the  particular 
implementation  of  a  network.  The  network  analyses  performed  in  the 
following  chapters  are  based  on  network  topology.  Thus,  the  results  obtained 
are  not  speciflc  to  any  particular  hardware  technology. 

Patterns  of  data  flow  through  a  network  can  be  classified  into  three  basic 
categories.  With  one-to-one  eonneetiona  information  is  passed  from  one 
network  port,  the  eouree,  to  another  network  port,  the  destination.  The  exact 
route  taken  by  the  information  is  its  path.  Some  networks  provide  multiple 
paths  between  a  source  and  destination.  Often,  information  flow  from  one 
source  to  two  or  more  destinations  is  supported  by  a  network.  Such  a  transfer 
is  termed  a  broadcast  connection,  and  the  route  taken  by  the  information  b  a 
broadcast  path.  Finally,  consider  a  set  of  non-intersecting  one-to-one 
connections,  that  is,  a  set  such  that  no  two  one-to-one  connections  have  the 
same  source  or  destination.  If  these  connections  can  be  created  simultaneously 
within  a  network,  a  permutation  connection  results. 

Many  networks  will  not  support  all  permutation  connections.  A 
permutation  b  not  supported  if  there  b  a  conflict  between  two  of  the  one-to- 
one  connections,  i.e.,  both  connections  require  a  single  output  of  some 
switching  element.  Two  such  paths  are  said  to  contend  for  the  switching 
element  output,  and  if  one  path  b  given  control  of  the  switch,  the  other 
experiences  blocking.  A  network  that  does  not  pass  some  permutations  due  to 
conflict  between  paths  b  a  blocking  network.  Switching  elements  can  be  viewed 
as  small  interconnection  networks  and,  hence,  may  also  be  characterized  as 
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blocking  if  they  do  not  support  all  permutation  connections  from  their  inputs 
to  outputs. 

In  addition  to  various  types  of  data  flow  patterns  that  may  be  allowed  by 
a  network,  there  are  two  basic  variations  of  information  transfer  protocol.  One 
is  circuit  ewitching.  In  this  protocol  a  complete  path  from  the  source  to  the 
destination  is  established  before  information  transmission  begins.  If  the 
network  provides  multiple  paths  they  may  be  searched  to  find  one  that  is  not 
blocked.  The  complete  path  is  maintained  until  the  data  transfer  ends,  then  it 
is  relinquished.  Complete  broadcast  paths  are  used  to  support  circuit-switched 
broadcasting. 

The  other  protocol  is  packet  switching.  In  thb  case,  a  complete  message  is 
subdivided  into  a  sequence  of  message  fragments  called  packets.  Each  packet 
carries  information  that  directs  it  through  the  network.  The  packets  are 
presented  one  at  a  time  to  the  network  by  the  source.  In  a  multistage  network, 
packets  move  stage  by  stage  toward  their  destinations.  At  each  stage  the 
appropriate  link  to  the  next  stage  is  determined  and  its  use  requested.  If  the 
switching  element  to  receive  the  data  in  the  next  stage  can  accommodate  the 
packet,  permission  to  send  is  granted  to  the  switching  element  with  the  packet. 
If  not,  the  packet  either  waits  for  passage  to  be  granted  or  is  sent  on  an 
alternate  link  (if  one  exists  and  is  available).  Both  unidirectional  and 
bidirectional  networks  can  be  designed  to  operate  under  either  circuit  or  packet 
switching  protocols,  or  both. 

There  are  three  basic  types  of  network  control  for  multistage  networks 
[SiS78].  One  is  individual  stage  control.  With  this  method  a  single  command 
signal  (which  may  be  several  bits)  sets  the  state  of  all  switching  elements  in  a 
stage.  Partial  stage  control  is  characterized  by  two  or  more  groupings  of 
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switching  elements  per  stage,  each  group  having  separate  control.  With 
individual  box  (or  switching  element)  control  the  state  of  each  network  switch 
can  be  set  independently. 

Network  control  can  be  implemented  in  either  a  centralized  or  distributed 
fashion.  More  typically  considered  for  centralized  control  are  networks  with  a 
small  number  of  input/output  ports  and/or  those  using  individual  stage  or 
partial  stage  control.  The  advantage  of  centralized  control  b  that  switching 
element  hardware  need  only  include  circuitry  to  handle  information  flow,  not 
establbh  or  maintain  the  avenues  of  flow.  The  dbadvantage  b  that  network 
performance  b  limited  by  the  central  controller.  Central  controller  speed 
requirements  become  more  extreme  as  network  size  increases  (larger  N)  and  for 
networks  in  which  selection  of  a  route  through  the  network  b  computationally 
involved. 

Distributed  control  is  particularly  well  suited  to  larger  networks  and  those 
with  individual  box  control.  Routing  tag$  are  a  way  of  encoding  information 
describing  a  path  through  a  network.  They  can  be  generated  by  the  devices 
using  the  network,  thus  dbtributing  control  of  the  network.  There  are  a 
number  of  advantages  to  dbtributed  control.  One  b  that  the  speed  of  network 
path  creation  b  not  limited  by  any  potential  central  controller  processing 
bottleneck,  as  it  might  be  with  centralized  control.  Another  b  the  ability  (for 
certain  networks,  e.g..  Augmented  Data  Manipulator  network  [McS82b])  to 
reroute  information  while  it  b  in  transit.  Switches  must  be  able  to  change 
state  based  on  the  contents  of  the  routing  tags,  and  there  must  be  provbion  in 
the  switch  for  arbitration  if  there  are  multiple  needs  for  a  single  switching 
element  output.  These  greater  hardware  requirements  may  result  in  high 
logic-to-pin  ratios,  allowing  switch  implementation  to  take  advantage  of  VLSI. 


A  disadvantage  of  distributed  network  control  is  the  potential  higher  cost  of 
switching  elements. 

Dividing  a  network  into  independent  subnetworks,  each  of  which  has  all 
the  properties  of  the  original  network  only  tor  a  smaller  number  of  ports,  is 
called  partitioning  [SieSO].  Partitioning  b  useful  for  networks  serving  in 
MSIMD  or  partition  able  SIMD/MIMD  systems.  A  subnetwork  can  be  assigned 
to  each  virtual  SIMD  or  MIMD  machine.  Since  subnetworks  are  dbjomt, 
machine  independence  b  guaranteed. 

1.6  SommaLry 

In  thb  chapter  the  aircraft  desi^  example  illustrated  one  specific  instance 
of  the  need  for  very  high-performance  computing  and  gave  motivation  to  the 
study  of  parallel  processing  computers.  The  nature  and  organization  of  the 
work  presented  in  thb  document  was  outlined.  Finally,  basic  terminology  and 
definitions  to  be  used  throughout  thb  work  were  given. 


CHAPTER  2 

SURVEY  OF  SELECTED  PARALLEL  COMPUTER  SYSTEMS 


2.1  Introduction 

The  field  of  parallel  computation  research  is  an  active  one.  Much  of  the 
activity  revolves  around  new  machine  designs,  and  well  over  one  hundred  have 
been  described  in  the  recent  literature.  It  b  feasible,  therefore,  to  present  only 
a  sample  of  the  work  that  has  been  done. 

Although  the  computers  described  in  the  following  range  from  systems  for 
which  there  b  only  a  design,  through  those  that  have  been  prototyped,  to  a 
machine  that  has  been  sold  commercially,  all  have  a  common  trait.  Each 
utilizes,  or  is  designed  to  utilize,  a  multbtage  interconnection  network  for 
communication  among  various  system  components.  Thus,  study  of  these 
systems  provides  information  on  the  environment  in  which  a  multbtage 
interconnection  network  will  operate. 

These  machines  also  represent  a  range  of  architectural  ideas  for  parallel 
computers  intended  for  signal  or  image  processing,  pattern  recognition,  and 
related  tasks.  They  thereby  illustrate  the  influence  of  the  application  on  the 
machine  design  for  parallel  computers.  Because  the  motivation  for  parallel 
computation  is  speed,  most  parallel  computers  are  optimized  for  a  particular 
application  domain.  Consequently,  it  is  important  that  parallel  computer 
architects  study  the  intended  applications  of  their  machines.  In  this  way  they 
can  build  confidence  that  the  finished  machine  will  perform  as  desired 


(typically,  meet  execution  speed  requirements). 


2.2  PASM 

PASM  (portitionable  5IMD/AiIMD)  is  a  special  purpose,  dynamically 
reconfigurable,  large-scale  multimicroprocessor  system  being  built  at  Purdue 
University  (SMS78,  SSK81,  SSD84).  Due  to  the  low  cost  of  microprocessors, 
computer  system  designers  have  been  considering  various  multimicrocomputer 
architectures  as  a  way  of  achieving  high  performance.  PASM  combines  the 
following  features;  (1)  partitionability,  allowing  operation  with  many 
independent  SIMD  and/or  MIMD  machines  of  various  sizes;  and  (2)  a  design 
guided  by  a  variety  of  problems  in  image  processing  and  pattern  recognition. 

Fig<jre  2.1  is  a  block  diagram  of  the  basic  components  of  PASM.  The 
heart  of  PASM  is  the  Parallel  Computation  Unit  (PCU)  which  contains  N  =  2" 
processors,  N  memory  modules,  and  an  interconnection  network.  The  PCU 
proeeasora  are  microprocessors  that  perform  the  actual  SIMD  and  MIMD 
computations.  The  PCU  memory  modulea  are  used  by  the  PCU  processors  for 
data  storage  in  SIMD  mode  and  both  data  and  instruction  storage  in  MIMD 
mode.  Figure  2.2  shows  that  the  processors  and  memory  modules  of  the  PCU 
are  organized  as  processing  elements.  Each  memory  module  consists  of  a  pair 
of  memory  units.  This  double- buffering  scheme  allows  data  to  be  moved 
between  one  memory  unit  and  secondary  storage  (the  Memory  Storage  System) 
while  the  processor  operates  on  data  in  the  other  memory  unit. 

The  interconnection  network  provides  a  means  of  communication  among 
the  PCU  PEs,  which  are  phyaically  numbered  (addreaaed)  from  0  to  N-1. 
PASM  will  use  either  an  Extra  Stage  Cube  type  network  (AdS82d)  or  an 
Augmented  Data  Manipulator  type  network  [AdS80,  McS82c,  SiM81a].  Both 
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Figure  2.1  Block  diagram  of  PASM. 
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consist  of  D  =  logjN  stages  of  switches  and  are  controlled  by  routing  tags 
[Law75,  SiMSla].  Both  can  be  partitioned  into  independent  subnetworks  of 
varying  sizes,  which  are  powers  of  two,  if  all  of  the  PEs  in  a  partition  of  size 
P  =  2P  have  the  same  value  in  the  low  order  n  -  p  bit  positions  of  their 
addresses  [SieSO].  Studies  are  currently  being  conducted  to  choose  one  of  these 
networks  to  implement  in  the  PASM  prototype  (e.g.,  [McS82b]).  This  work  is 
relevant  to  that  effort. 

The  Micro  ControUera  (MCa)  are  a  set  of  Q  =  2*’  microprocessors, 
physically  numbered  (addressed)  from  0  to  Q  — 1,  that  are  the  control  units  for 
PCU  processors  operating  in  SIMD  mode  and  orchestrate  the  activities  of  PCU 
processors  in  MIMD  mode.  Each  MC  is  attached  to  a  memory  module  which, 
like  the  PCU  memory  modules,  consists  of  a  pair  of  memory  units  so  that 
memory  loading  and  computations  can  be  overlapped.  Control  Storage 
contains  the  programs  for  the  MCs. 

Each  MC  controls  N/Q  PCU  processors.  The  physical  addresses  of  the 
N/Q  PEs  connected  to  an  MC,  shown  in  Figure  2.3,  have  as  their  low-order  q 
bits  the  physical  address  of  the  MC,  so  that  the  MCs  can  support  system 
partitioning  with  either  an  Extra  Stage  Cube  type  or  Augmented  Data 
Manipulator  type  network  (SSK81].  Possible  values  for  N  and  Q  are  1024  and 
32,  respectively.  Loading  R  MC  memory  modules  with  the  same  instructions 
simultaneously  yields  a  virtual  SIMD  machine  (partition)  of  size  RN/Q,  R  =2^ 
and  0  <  r  <  q  [SMS78J.  In  SIMD  mode,  the  R  MCs  are  synchronized  and  each 
MC  fetches  instructions  from  its  memory  module,  executing  the  control  flow 
instructions  (e.g.,  branches)  and  broadcasting  the  data  processing  instructions 
to  its  PCU  PEs.  Similarly,  a  virtual  MIMD  machine  of  size  RN/Q  results  from 
combining  the  independent  efforts  of  the  PCU  processors  of  R  MCs.  In  both 


cases,  the  physical  addresses  of  these  MCs  must  have  the  same  low-order  q-r 
bits  so  that  all  of  the  PCU  PEs  in  the  partition  have  the  same  low-order  q-r 
physical  address  bits. 

The  basic  MC  organization  can  be  enhanced  to  allow  the  sharing  of 
memory  modules  by  the  MCs  in  a  partition.  The  MCs  can  be  connected  by  a 
shared  reconfigurable  (“shortable")  bus  such  as  described  in  |ArP76).  This  is 
shown  in  Figure  2.4  for  Q=8.  The  MCs  must  be  ordered  on  the  bus  by 
increasing  order  of  the  bit  reverse  of  their  addresses  so  MCs  that  agree  in  their 
low-order  address  bits  can  share  memory  modules.  Figure  2.5  depicts  the  MC 
processors  and  memory  modules  with  the  reconfigurable  bus  for  Q  =  16.  This 
enhanced  MC  connection  scheme  provides  more  program  space  for  jobs  using 
multiple  MCs  and  a  degree  of  fault  tolerance,  since  known-faulty  MC  memory 
modules  can  be  avoided  in  multiple-MC  partitions.  These  advantages  come  at 
the  expense  of  additional  system  complexity.  The  use  of  such  a  reconfigurable 
bus  to  share  memories  is  also  discussed  in  (KaK70]. 

Within  each  partition,  the  PCU  PEs  have  loffieat  addresses.  Given  a 
virtual  machine  of  size  RN/Q,  the  PEs  have  logical  addresses  0  to  (RN/Q)-1 
(the  high-order  r  +  n-q  bits  of  the  physical  addresses).  Similarly,  the  MCs  are 
assigned  logical  addresses  from  0  to  R-1  (for  R  >  1,  the  high-order  r  bits  of 
the  physical  address).  The  PASM  language  compilers  and  operating  system 
will  translate  between  logical  and  physical  addresses,  so  a  system  user  need 
deal  only  with  logical  addresses. 

The  Memory  Storage  System  provides  secondary  storage  for  data  (SIMD 
mode)  or  programs  and  data  (MIMD  mode).  Multiple  devices  are  used  to  allow 
parallel  data  transfers.  The  Memory  Storage  System  consists  of  N/Q 
independent  Memory  Storage  Units,  numbered  from  0  to  (N/Q)-l.  Each 


Memory  Storage  Unit  is  connected  to  Q  PCU  memory  modules.  For 
0  <  i  <  N/Q,  Memory  Storage  Unit  i  b  connected  to  those  memory  modules 
whose  physical  addresses  are  of  the  form  (Q  *  i)  +  k,  0  <  k  <  Q.  Thus,  Memory 
Storage  Unit  i  is  connected  to  the  i*'**  PE  of  each  MC  as  shown  in  Figure  2.6  for 
N  =32  and  Q  =4. 

The  advantages  of  this  Memory  Storage  Unit  connection  scheme  are  that 
for  a  partition  of  size  N/Q  all  of  the  memory  modules  can  be  loaded  in  parallel 
and  the  data  is  directly  available  no  matter  which  partition  (MC  group)  is 
chosen.  This  is  achieved  by  storing  in  Memory  Storage  Unit  i  the  data  for  a 
task  which  is  to  be  loaded  into  the  i*^  logical  memory  module  of  the  virtual 
machine  of  size  N/Q,  0  <  i  <  N/Q.  Thus,  no  matter  which  MC  group  of  N/Q 
processors  is  chosen,  the  data  from  the  i*'^  Memory  Storage  Unit  can  be  loaded 
into  the  i^^  logical  memory  module  of  the  virtual  machine,  for  all  i, 
0  <  i  <  N/Q,  simultaneously,  i.e.,  in  one  parallel  block  transfer.  This  same 
approach  can  be  taken  if  only  (N/Q)/2‘*  distinct  Memory  Storage  Units  are 
available,  0  <  d  <  n-q,  however,  2**  parallel  block  loads  will  be  required  instead 
of  just  one.  In  general,  a  task  needing  RN/Q  processors,  1  <  R  <  Q,  logically 
numbered  0  to  (RN/Q)-1,  will  require  R  parallel  block  loads  if  the  data  for 
the  memory  module  whose  high-order  n-q  logical  address  bits  equal  i  is  loaded 
into  Memory  Storage  Unit  i.  This  is  true  no  matter  which  group  of  R  MCs 
(which  agree  in  their  low-order  q-r  address  bits)  is  chosen.  If  only  (N/Q)/2'* 
distinct  Memory  Storage  Units  are  available,  0  <  d  <  n-q,  then  R*2**  parallel 
block  loads  will  be  required  instead  of  just  R. 

The  Memory  Management  System  (MMS)  controls  the  loading  and 
unloading  of  the  PCU  memory  modules.  It  employs  a  set  of  four  cooperating, 
dedicated  microprocessors.  The  Directory  Processor  (DP)  receives  requests 


from  the  SCU  and  the  MCs  to  load  or  unload  PE  memory  modules.  In  turn  it 
generates  commands  for  and  coordinates  the  actions  of  other  MMS  processors. 
The  Memory  Scheduling  Processor  (MSP)  receives  the  commands  from  the 
Directory  Processor  and  determines  the  order  in  which  they  should  be 
performed.  The  Command  Distribution  Processor  (CDP)  issues  these 
commands  to  the  MSUs  and  processes  command  completion  acknowledgements. 
The  Input/Output  Processor  (lOP)  handles  the  transfer  of  files  between  the 
MSUs  and  peripheral  devices.  It  also  coordinates  the  reformatting  and 
distribution  of  files  among  the  MSUs. 

This  distributed  processing  approach  is  chosen  in  order  to  provide  the 
Memory  Management  System  with  a  large  amount  of  processing  power  and 
high  speed  (due  to  the  parallelism  possible)  at  low  cost.  A  large  amount  of 
power  is  requited  because  it  is  not  desirable  to  burden  the  SCU  with  any 
memory  management  tasks.  Furthermore,  the  number  of  files  to  be  managed  is 
enormous:  a  user  request  for  an  input  file  for  a  I024-PE  SEMD  program  would 
involve  the  management  of  1024  file  directory  lookups  and  transfers.  The 
management  problem  becomes  more  severe  when  multiple  simultaneous  users  of 
PASM  are  considered. 

The  System  Control  Unit  (SCU)  is  a  conventional  machine,  such  as  a 
PDP-11/70,  and  is  responsible  for  the  overall  coordination  of  the  activities  of 
the  Memory  Management  System  and  the  Micro  Controllers.  In  addition,  the 
SCU  is  capable  of  functioning  independently  as  a  serial  processor.  While  the 
rest  of  PASM  executes  a  parallel  computation,  the  SCU  can  handle  such  tasks 
as  program  development  and  job  scheduling.  In  order  to  perform  these 
functions,  the  SCU  must  contain  the  PASM  language  compilers  and  assemblers 
and  portions  of  the  PASM  operating  system,  PASMOS  [Tuo83|. 
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Figure  2.7  shows  a  block  diagram  of  the  PASM  system  prototype  that  is 
under  construction.  The  interconnection  network  used  in  the  prototype,  which 
will  be  an  Extra  Stage  Cube  (see  Chapter  3),  is  not  shown  in  this  figure.  The 
prototype  will  include  N  =  16  PEs  and  Q  =  4  MCs.  Thus,  each  MC  will  control 
4  (=  N/Q)  PEs.  The  System  Control  Unit  will  be  a  dedicated  microprocessor. 
Users  will  access  the  PASM  prototype  system  via  a  Purdue  University 
Engineering  Computer  Network  (ECN)  machine  as  shown  in  the  figure. 
Commands  (jobs)  initiated  by  a  user  are  sent  from  the  ECN  machine  on  which 
that  user  is  logged  to  the  prototype  SCU,  Response(s)  to  a  user  command 
(other  than  large  files,  such  as  processed  imagery)  are  returned  by  the  SCU. 

For  additional  information  about  various  aspects  of  PASM  see: 
organization  {SMS78,  SSK81],  instruction  set  [SMS78),  masking  schemes  for 
enabling  and  disabling  PEs  [Sie77a,  Sie77b,  SMS78,  SSKSIj,  interconnection 
networks  [AdS82b,  AdS82d,  Sie77a,  Sie79a,  Sie79b,  SieSO,  SiS78],  operating 
system  (SSK81,  TuS82a,  TuS82b,  TuS83,  TuS84a],  programming  language 
[C1S83,  MSSSOa],  memory  management  system  [KSG83,  SKW79,  SSK81, 
TuS83,  TuS84b],  prototype  design  and  simulation  [KSH82],  and  examples  of 
possible  applications  (KST85,  MSS80b,  SSE80,  SSF82,  WaS82].  A  reading  list 
of  over  50  PASM-related  papers  is  in  |SSD84]. 

2.3  Numerical  Aerodynamic  Simulation  Facility 

The  Numerical  Aerodynamic  Simulation  Facility  (NASF)  {Bur79)  is 
intended  to  support  execution,  in  ten  minutes  or  less,  of  time-averaged  Navier- 
Stokes  computations  on  steady  fluid  flow  problems  involving  a  million  grid 
points.  Scientists  at  the  National  Aeronautics  and  Space  Administration  need 
this  processing  power  for  their  research.  This  requires  an  average  rate  of 
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execution  of  roughly  one  billion  floating  point  operations  per  second.  A  block 
diagram  of  system  hardware  is  shown  in  Figure  2.8. 

The  Flow  Model  Processor  (FMP)  is  the  core  of  the  system.  Figure  2.9 
shows  the  organization  of  the  FMP.  The  FMP  has  a  structure  akin  to  the  P- 
to-M  model  shown  in  Figure  1.2.  It  has  512  processors  (that  meet  the 
definition  of  PE  given  in  Chapter  1);  a  bidirectional,  circuit-switched, 
multistage  interconnection  network  [BaLSl]  (called  the  Connection  Network 
(CN))  based  on  the  Omega  network  [Law75);  521  Extended  Memory  (EM) 
modules;  a  Data  Base  Memory  (DBM);  a  Coordinator  (CR);  and  a  Diagnostic 
Controller  (DC).  There  are  four  on-line  spares  each  of  processors  and  EM 
modules  to  enhance  FMP  reliability.  The  processors  each  contain  a  scalar 
execution  unit  and  storage  for  data  and  programs;  both  integer  and  floating 
point  arithmetic  units  are  included.  The  EM  contains  the  data  common  to  the 
processes  being  independently  evaluated  by  each  of  the  processors.  The  DBM 
is  slower  than  the  EM  and  is  provided  to  hold  the  past,  present,  and  future 
jobs  scheduled  on  the  system. 

The  CR  has  connections  to  both  sides  of  the  network  (24  signals  each) 
which  allow  it  to  access  any  memory  module  or  processor.  There  is  a  four-bit 
command  bus  plus  strobe  connected  to  all  processors  for  synchronization  and 
diagnostic  purposes.  The  CR  has  an  I/O  channel  to  the  host  and  generates 
clock  signals  for  and  receives  error  interrupt  signals  from  the  memory  modules. 
It  can  also  issue  descriptors  to  and  receive  status  from  the  DBM,  and  there  are 
connections  to  the  DC. 

Under  normal  operating  conditions,  the  switching  elements  of  the  CN 
provide  straight  and  exchange  connections  bidirectionally.  Under  special 
command  from  the  CR,  a  broadcast  connection  can  be  established.  A  path  is 
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Figure  2.9  General  organization  of  the  Flow  Model  Processor  [Bur79j 


created  from  the  CR’s  connection  on  the  memory  module  side  of  the  network 
to  all  processors.  The  CR  can  also  specify  that  an  arbitrary  module  broadcast 
a  specified  data  item  to  all  processors. 

Another  special  connection  that  can  be  initiated  by  the  CR  is  a  direct 
exchange  of  data  by  pairs  of  processors.  This  “wraparound”  connection  allows 
all  pairs  of  processors  whose  addresses  differ  in  the  i-th  bit  to  swap  data  items. 

The  CR  is  also  responsible  for  program  initiation  and  diagnostics.  It  has 
access  via  the  network  to  every  processor  register  and,  generally,  can  exercise 
any  intraprocessor  activity.  Program  loading,  transfers  between  disk  and 
DBM,  and  some  diagnostics  are  handled  by  the  host  computer.  The  DC  is  an 
interface  between  the  host  computer  and  the  CR;  it  allows  the  host  to  diagnose 
the  CR. 

All  connections  established  in  the  CN  are  strictly  circuit  switched.  Two 
options  were,  however,  considered  for  establishing  circuits.  The  first  method 
provides  two  registers  in  each  node  for  latching  routing  tag  and  memory 
address  information.  This  makes  establbhing  a  circuit  very  fast  since  the 
interface  does  not  have  to  wait  until  the  circuit  is  complete  before  transmitting 
the  memory  address.  The  second  method,  tentatively  chosen  because  of  its 
lower  cost,  includes  no  such  registers.  The  interface  must  hold  the  routing  tag 
on  the  data  path  until  the  circuit  is  complete.  Due  to  the  anticipated  use  of 
the  system  in  which  all  processors  participate  in  computations,  there  is  no 
attempt  to  take  advantage  of  the  partitioning  abilities  of  the  network  [SieSO]. 
Partitionability  is  considered  only  to  distribute  sections  of  the  network 
advantageously  among  equipment  cabinets. 


The  Support  Processing  System  serves  as  the  central  control,  interfaces 
with  users  and  peripherals,  maintains  data  files,  and  provides  computational 
support  necessary  to  keep  FMP  utilization  high.  It  consists  of  three 
subsystems;  the  Support  Processor,  the  File  System,  and  the  Peripheral 
Support  System  (see  Figure  2.8). 

The  Support  Processor,  which  acts  as  a  system  host  processor,  is  a  dual¬ 
processor  Burroughs  B7800.  The  planned  configuration  includes  redundancy  in 
most  elements  of  the  B7800  for  high  availability.  Since  the  Support  Processor 
is  the  master  control  for  the  NASF  facility,  most  user  communication  is 
supported  by  it.  Up  to  96  input  lines  are  envisioned. 

Study  of  output  device  requirements  indicated  that  some  anticipated 
peripherals  would  need  a  significant  amount  of  computational  support.  The 
Computer  Output  to  Microfilm  (COM)  device  b  the  most  demanding  of  these. 
NASA  postulates  a  COM  load  in  excess  of  10,000  frames  per  day,  with  output 
assumed  to  be  graphic  images.  The  majority  of  thb  load  b  for  “movies”  of 
simulation  results.  The  Peripheral  Support  System  is  configured  with  two 
high-end  minicomputers  with  special-purpose  software  for  handling  these  tasks. 

The  volume  of  data  and  programs  to  be  moved  in  and  out  of  the  FMP 
together  with  the  amount  of  file  management  required  for  the  total  system 
indicate  strongly  that  a  separate  system  be  provided  for  thb  purpose,  rather 
than  using  the  Support  Processor  (B7800)  itself.  Abo,  with  file  management 
functions  in  a  processor  separate  from  processors  executing  user  programs, 
security  capability  b  enhanced.  The  File  System  includes  the  disk  packs,  the 
archival  store,  and  the  file  manager.  It  performs  directory  management  and 
storage  allocation  functions  and  b  responsible  for  data  ownership  and  access 
controls.  The  DBM  may  also  be  considered  part  of  the  File  System  because  it 


is  a  staging  area  for  FMP  programs  and  data. 


Achieving  adequate  fault  tolerance  is  an  important  design  issue  for  NASF. 
Many  features  are  built  into  the  system  to  provide  fault  tolerance.  First,  each 
48  bit  data  word  is  accompanied  by  a  7-bit  single-error-correcting/double- 
error-detecting  (SECDED)  code  throughout  the  system.  To  maintain  the  long 
term  integrity  of  the  data  stored  in  the  DBM,  each  data  item  b  periodically 
“scrubbed.”  That  b,  all  single  bit  errors  are  corrected  often  enough  to  reduce 
drastically  the  chance  of  a  double  error.  Second,  whenever  a  memory  request  b 
made,  the  selected  memory  module  compares  the  incoming  routing  tag  to  its 
own  address,  detecting  mbrouted  packets.  Any  dbcrepancy  b  reported  as  an 
error  by  sending  an  interrupt  to  the  CR.  Third,  there  are  four  spare 
processor-memory  pairs  and  four  spare  memory  modules,  all  on>line.  If 
diagnostic  tests  detect  a  faulty  processor,  for  example,  a  new  processor-memory 
pair  can  be  automatically  switched  into  its  place.  The  switch  b  accomplbhed 
by  performing  an  address  transformation  in  hardware.  Finally,  automatic 
recovery  of  the  61e  system  is  to  be  implemented  in  software.  All  errors  of  any 
kind  are  recorded  and  reported. 

Effort  has  been  made  toward  developing  a  language  for  NASF  called  FMP 
FORTRAN.  FMP  FORTRAN  b  based  on  American  National  Standards 
Institute  (ANSI)  FORTRAN  77  [ANS78j  with  extensions  and  modifications  to 
improve  its  utility  for  the  planned  applications  and  to  allow  more  efficient  use 
of  FMP  hardware.  The  additional  language  constructs  provide  means  of 
describing  both  spatial  (geometry  and  state)  and  temporal  (processes) 
relationships  in  a  simulation  model.  The  additional  language  constructs  are  as 
follows. 


1.  DOMAIN:  Used  to  define  all  those  discrete  points  in  the  simulation 
model  grid  at  which  state  information  will  exist  and/or  processing  will 
occur  (similar  to  array  dimensioning). 

2.  REGION:  Allows  creation  of  a  virtual  domain  of  interest  through 
dynamic  selection  of  elements  of  a  DOMAIN. 

3.  INALL:  Provides  for  declaration  of  variables  associated  with  grid  points 
in  a  simulation  model  without  using  subscripts. 

4.  DOALL  ...  USING 

ENDDO  ...  GIVING:  Used  to  express  the  inherent  concurrency  in  a 
program  loop.  The  USING  and  GIVING  portions  of  the  construct 
explicitly  define  data  dependency  of  the  simulation  model  so  that  the 
compiler  can  anticipate  requests  for  data  transfer  between  the  EM  and 
processors. 

5.  SUMALL:  Instructs  a  processor  to  sum  all  instances  of  an  indicated 
variable  stored  in  that  processor’s  associated  memory. 

6.  LOCATION:  Returns  the  instance  number  of  the  successful  instance  of 
the  most  recent  execution  of  MAXALL  (or  MINALL),  which  finds  the 
maximum  (or  minimum)  value  in  a  data  structure. 

7.  RECURRENCE:  Uses  recursive  doubling  to  solve  recurrence  relations 
on  one-dimensional  domains. 

During  normal  NASF  operation,  all  data  and  program  code  for  a  task  is 
first  loaded  into  the  DBM.  DBM  loading  is  scheduled  by  the  Support  Processor 
via  the  File  System.  The  Support  Processor  scheduler  initiates  a  task  on  the 
FMP  through  interaction  with  the  coordinator  (CR).  When  task  execution  is 


to  begin,  software  in  the  CR  starts  transfer  of  code  files  from  the  DBM  to  the 
EM.  From  there,  the  CR  causes  its  code  files  to  be  loaded  into  its  memory  and 
causes  processor  code  files  to  be  broadcast  to  each  processor.  The  initialization 
phase  of  the  program  (now  in  the  CR)  then  transfers  necessary  data  to  the  EM. 
With  data  in  place  in  the  EM  and  code  files  in  place  in  the  CR  and  processors, 
task  execution  starts.  While  task  execution  is  in  progress,  the  CR  serves  as  a 
high-level  “instruction  sequencer”;  processor  tasks  are  explicitly  initiated,  and 
when  all  processors  complete  their  tasks  the  CR  initiates  the  next  task. 

2.4  Texas  Reconflgurable  Array  Computer 

The  Texas  Reeonfigurable  Array  Computer  (TRAC)  is  a  microprocessor- 
based  partitionable  SIMD/MIMD  system  [LiT77,  SUKSOj.  A  prototype  has 
been  built  at  the  University  of  Texas  at  Austin.  The  important  attributes  of 
the  design  include  memory  space  sharing,  reconfigurability,  varistructuring, 
inter-task  communication  ability,  and  its  appearance  as  a  virtual  machine  to 
the  user.  Memory  space  sharing  means  independent  or  interacting  tasks  can 
run  simultaneously.  TRAC  can  dynamically  change  existing  partitions  of 
processors  and  other  resources,  making  it  reeonfigurable.  Varistructuring  refers 
to  the  fact  that  processors  can  be  grouped  to  process  various  data  widths.  The 
system  is  virtual  in  that  programs  need  not  know  on  what  set  of  processors 
they  are  executing.  Figure  2.10  gives  a  block  diagram  of  TRAC. 

TRAC  processors  operate  on  eight-bit  data,  but  can  be  linked  together  for 
multi-byte  operations  fKPLSOj.  Each  contains  a  microcoded  Register 
Arithmetic  Logic  Unit  (RALU),  several  control  units,  a  packet  buffer,  control 
storage  and  a  sequencer,  and  status  registers.  The  RALU  uses  two  2901  bit- 
sliced,  microprogrammed  processors.  The  status  registers  store  substantial 
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information  concerning  system  operational  and  structural  condition. 

An  SW-banyan  multistage,  bidirectional  interconnection  network  links  the 
processors  to  the  memory  modules  and  I/O  ports  (MIOs)  [PKM80].  The 
network  consists  of  apex,  base,  and  intermediate  nodes.  Processors  are 
attached  to  the  apex  nodes,  while  MIOs  are  connected  to  base  nodes.  The 
intermediate  nodes  represent  switching  elements.  The  network  as  implemented 
is  capable  of  three  tree-structured  types  of  connections;  data  trees,  instruction 
trees,  and  shared  memory  trees.  A  data  tree  links  a  single  processor  to  several 
MIOs  for  Single  Aistruction  Stream  -  Single  Z)ata  Stream  (SISD)  mode 
operation.  An  instruction  tree  connects  several  processors  to  a  base  node  of 
the  network,  making  SIMD  execution  possible.  A  shared  memory  tree  is 
similar  to  an  instruction  tree  except  that  any  one  of  the  processors  on  a  shared 
tree  can  exclusively  acquire  the  shared  MIO  of  the  tree.  An  explicit  release  of 
the  MIO  is  required  before  other  processors  of  the  tree  have  the  opportunity  to 
establish  control  of  the  MIO.  There  is  also  a  controller  for  the  network  used  to 
create  the  trees. 

Embedded  in  any  instruction  or  shared  memory  tree  is  a  General  Purpose 
Communication  (GPC)  link.  The  GPC  link  provides  a  high  speed,  bit  serial, 
bidirectional  communication  path  among  the  group  of  processors  associated 
with  an  instruction  or  shared  memory  tree.  For  instruction  trees  the  GPC 
paths  might  be  used  for  carry  propagation  in  high  precision  arithmetic.  Also, 
conflicting  memory  requests  in  shared  memory  trees  can  be  arbitrated  through 
use  of  GPC  links. 

Data  transfer  in  the  trees  is  via  circuit  switching.  However,  the  network 
also  supports  packet  switching  in  the  background.  The  MIOs  are  synchronized, 
so  during  the  memory  access  cycle  the  links  of  the  network  are  idle  and,  thus. 


free  to  forward  packets.  Packets  are  sent  by  explicit  instruction  or  as  an 
implicit  part  of  some  processor  microinstruction. 

Secondary  storage,  called  the  Self -Managing  Secondary  Memory  (SMSM), 
is  attached  to  one  I/O  port.  It  is  capable  of  storing  variable  length  records 
along  with  a  label;  SMSM  hardware  can  search  for  records  by  their  labels. 
Other  I/O  ports  can  support  various  peripherals  including  terminals. 

2.6  STARAN 

STARAN  is  a  system  built  by  Goodyear  Aerospace  Corporation,  designed 
to  combine  the  attributes  of  parallelism,  associativity,  and  low  cost  [Bat74]. 
An  associative  processor  is  an  SIMD  machine  in  which  data  can  be  retrieved 
using  the  associative  concept,  i.e.,  by  contents  rather  than  by  physical  location 
(BaeSO).  Low  cost  was  achieved  by  using  off-the-shelf  components.  A  more 
recent  version,  the  STARAN  Series  E,  is  essentially  the  same  system,  but  with 
more  memory  and  implemented  with  some  higher  speed  circuitry  [Bat77b). 
Applications  envisioned  for  STARAN  by  the  designers  included  Fast  Fourier 
Transform  (FFT)  operations,  sonar  post-processing  involving  FFTs,  high-speed 
character  string  searching,  file  (data  base)  processing,  and  air  traffic  control. 
Figure  2.11  shows  a  block  diagram  of  a  typical  STARAN  system. 

A  host  computer  is  not  required  for  STARAN  operation.  The  port  is 
provided  so  that  STARAN  can  be  operated  as  a  special-purpose  peripheral  with 
respect  to  a  host.  STARAN  hosts  have  included  a  Honeywell  HlS-645 
connected  via  an  I/O  channel  and  an  XDS  E5  computer  interfaced  through  a 
direct  memory  access  port. 
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The  AP  Control  exercises  control  over  its  attached  Array  Modulea;  all 
arrays  follow  the  same  instruction  stream.  It  fetches  instructions  from  the 
Control  Memory,  which  holds  the  AP  Control  programs,  PIO  control  programs, 
and  microprogram  subroutines.  AP  Control  includes  registers  to  hold  the 
current  instruction;  a  program  status  word;  a  comparand  word,  operand  for  the 
array  modules,  or  result  from  the  array  modules;  the  active/inactive  select  for 
each  array  module;  four  pointers  for  use  in  array  module  memory  access;  three 
loop  counters;  and  two  array  module  memory  access  mode  words. 

The  Parallel  Input/  Output  (PIO)  is  used  for  high  bandwidth  I/O  and  data 
transfers  between  array  modules.  The  PIO  Control  Unit  controls  the  PIO  Flip 
Network  and  the  attached  array  modules.  It  receives  data  and  instructions 
from  the  control  memory.  The  PIO  flip  network  switches  data  among  eight 
25&-bit  ports.  Ports  0  through  3  connect  to  the  four  array  modules.  Port  7 
connects  to  a  bus  in  the  PIO  Control.  Ports  4,  5,  and  6  are  spares  and  can  be 
used  for  high  bandwidth  peripherals  or  additional  array  modules.  The  PIO 
Control  Unit  transfers  data  to  and  from  array  modules  not  being  used  by  AP 
Control. 

A  Digital  Equipment  Corporation  (DEC)  PDP-11  minicomputer  is  used  for 
the  Sequential  Control,  performing  peripheral  handling,  system  console,  and 
diagnostic  functions.  Sequential  Control  can  access  any  part  of  control 
memory.  Synchronization  of  Sequential  Control,  AP  Control,  and  PIO  Control 
is  accomplished  by  the  External  Function  (EFX)  lope.  Control  units  issue 
commands  to  EFX  logic  to  cause  system  actions  and  read  system  states. 

The  array  modules  perform  the  data  processing  for  STARAN. 
Figure  2.12  is  a  diagram  of  an  array  module.  A  key  component  of  an  array 
module  is  the  multidimensional  access  (MDA)  memory.  The  MDA  memory 
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stores  2*^  =  64K  bits  of  information,  implemented  as  2S6  words  of  256  bits.  In 
STAR  AN  Series  E,  an  enhanced  version  of  STARAN,  the  maximum  MDA 
storage  in  each  array  module  was  expanded  to  2^*  =  16M  bits  [6at77b]. 
Reading  and  writing  is  controlled  by  specifying  the  shape  and  position  of  an 
access  stencil  (Bat77a].  An  eight-bit  access  mode  word,  M,  specifies  stencil 
shape,  and  G,  an  eight-bit  global  address  word,  determines  stencil  position. 
Every  stencil  refers  to  exactly  2*  =  256  bits  of  storage.  The  specified  bits  can 
be  fetched  or  stored  in  one  memory  cycle. 

A  flip  network  is  also  used  in  each  array  module  to  provide 
communications  between  the  MDA  memory  and  the  256  one-bit  processing 
elements  of  the  module,  and  between  processing  elements.  The  network 
operates  in  circuit  switched  mode.  The  Selector  determines  the  parties  to  an 
information  transaction.  Given  the  structure  formed  with  the  aid  of  the 
Selector,  STARAN  can  be  considered  both  a  PE-to-PE  and  P-to-M  parallel 
processor  (even  though  the  network  is  unidirectional). 

In  an  array  module  there  are  three  256-bit  registers  (M,  X,  and  Y)  that 
perform  bit-oriented  operations,  one  bit  associated  with  each  processor. 
Register  M  drives  the  write  mask  bus  of  the  MDA  memory  to  select  which  of 
the  MDA  memory  bits  are  modified  in  a  masked-write  operation.  Registers  X 
and  Y  are  used  for  data  processing.  If  fj  b  the  i‘**  output  of  the  flip  network,  X\ 
and  yj  are  the  i*'*'  bits  of  X  and  Y,  respectively,  and  0  denotes  any  Boolean 
function,  then  an  array  module  can  perform  the  following  operations. 


1.  Xj  ^  (^(Xi,fi). 
2-  yi  ^  <^(yi,fi)- 


3.  Operations  (1)  and  (2)  simultaneously  using  the  same  Boolean  function. 

4.  Operation  (1)  only  when  y;  =  1,  else  Xj  unchanged. 

5.  Operation  (4)  with  the  current  yj,  plus  operation  (2)  simultaneously 
producing  a  new  yj. 

There  is  a  resolver  connected  to  the  Y  register.  The  resolver  reads  the 
state  of  the  Y  register,  and  if  any  bit  of  Y  is  set,  the  aetivily-or  output  of  the 
resolver  is  set.  If  some  Y  bits  are  set,  the  address  output  lines  of  the  resolver 
indicate  the  bit  position  of  the  first  set  bit.  The  result  of  an  associative  search 
of  the  MDA  is  stored  in  the  Y  register.  The  resolver  thus  provides  a 
SOME/NONE  match  flag  and  information  to  implement  the  SELECT  FIRST 
RESPONDER  function  (Fos76]. 

2.8  PUMPS 

PUMPS  is  a  systems  being  designed  at  Purdue  University  for  pattern  and 
image  analysis  (BFH821.  An  underlying  concept  used  in  designing  PUMPS  is 
that  cost-effective  solutions  to  many  applications  such  as  pattern  analysis 
require  some  form  of  functional  specialization  in  the  computer  architecture. 
Figure  2.13  depicts  the  architecture  of  PUMPS. 

Functional  specialization  is  expressed  in  PUMPS  on  a  system  level  by  its 
three  major  subsystems:  man-machine  interface,  image  analysis,  and  data 
base  management.  There  are  P  task  processing  units  (TPUs),  each  of  which  is 
multiprogrammed.  They  can  operate  in  an  interactive  fashion  through  the 
shared  peripheral  processors  and  VLSI  units  (PPVUs)  and  shared  memory 
(SM)  system.  The  special  resource  arbitration  network  (SPAN)  can  link  a 
TPU  to  a  chosen  PPVU  or  provide  a  path  between  PPVUs.  It  is  some  low 
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conflict  multistage  interconnection  network.  Allocation  of  PPVUs  to  the  TPUs 
is  dynamic  and  depends  on  the  needs  of  the  currently  executing  task.  The 
specific  functions  provided  by  the  PPVUs  can  be  varied  to  suit  application 
requirements.  Thus,  the  PPVUs  realize  functional  specialization  in  PUMPS  on 
a  more  local  level.  The  PPVUs  together  are  the  heart  of  the  image  analysis 
subsystem  of  PUMPS.  The  TPUs  pass  information  among  themselves  using 
the  task  processor  communications  (TPC)  bus.  Such  communication  is 
envisioned  as  consisting  mainly  of  interrupts,  synchronization  signals,  and 
control  signals.  TPCs  initiate  PPVU  tasks,  execute  purely  sequential  tasks, 
and  participate  in  both  MIMD  and  operating  system  processes. 

Within  a  TPU  there  is  a  task  processor  (TP),  a  task  cache  (TC),  a  task 
memory  (TM),  a  task  memory  management  unit  (TMMU),  and  a  resource 
controller  with  data  channels  (RCDC).  The  TP,  TC,  TM,  and  TMMU  together 
form  a  conventional  serial  processor.  When  there  is  a  TP  data  request 
resulting  in  a  TC  miss,  the  TMMU  fetches  the  data  from  the  TM  if  it  resides 
there.  If  a  TM  page  fault  results,  the  block-transfer  oriented  processor-memory 
interconnection  network  (PMIN)  links  the  TPU  with  the  shared  memory  to 
service  the  fault. 

The  shared  memory  is  connected  to  the  file  memories  (FM)  via  the 
backend  image  database  management  network  (BDMN).  This  network  b 
designed  to  handle  data  transfer  from  the  multiple  dbks  comprbing  the  FM. 
The  database  subsystem  of  PUMPS,  mentioned  earlier,  consbts  of  the  FM, 
BDMN,  and  a  backend  computer  (not  shown  in  Figure  2.13)  connected  to  the 
BDMN. 
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The  front-end  communications  processor  (FECP)  provides  the  man- 
machine  interface  of  PUMPS.  It  perforins  such  tasks  as  editing  and  terminal 
I/O. 

2.7  Conclusions 

This  chapter  has  overviewed  the  PASM,  NASF,  TRAC,  STAR  AN,  and 
PUMPS  parallel  processing  systems.  These  machines  utilize  multistage 
interconnection  networks  and,  so,  could  potentially  employ  the  fault- tolerant 
network,  or  concepts  from  it,  that  is  defined  and  studied  in  chapters  that 
follow.  Further,  these  systems  illustrate  that  the  specific  applications  to  be 
performed  can  profitably  be  taken  into  account  when  specifying  a  parallel 
computer  architecture.  With  thb  motivation,  a  subsequent  chapter 
investigates  an  image  processing  task  and  uncovers  what  it  would  require  of  a 
parallel  computer  architecture  and  its  interconnection  network. 
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CHAPTER  3 

DEFINITION  OF  THE  EXTRA  STAGE  CUBE 
INTERCONNECTION  NETWORK 


3.1  Introduction 

The  Extra  Stage  Cube  (ESC)  (AdS82a,  AdS82c,  AdS82d,  AdS82e,  AdS84aJ 
is  a  fault-tolerant  interconnection  network  intended  for  use  in  large-scale 
SIMD,  MSIMD,  MIMD,  and  partitionable  SIMD/MIMD  computer  systems.  The 
ESC  is  derived  from  the  Generalized  Cube  interconnection  network  [SiS78, 
SiMSlb].  It  consists  of  a  Generalized  Cube  network  with  one  additional  stage 
of  switching  elements  at  the  input  and  additional  hardware  to  allow  the 
bypass,  when  desired,  of  this  extra  stage  or  the  output  stage. 

Multistage  cube-type  networks  such  as  the  baseline  [WuFSO],  delta 
(PatSl),  Generalized  Cube  (SiS78j,  indirect  binary  n-cube  [Pea77),  omega 
[Law75],  STARAN  flip  (Bat76],  and  SW-banyan  (S=F=2,  L=n)  [GoL73]  have 
been  proposed  for  use  in  parallel  computer  systems.  These  include  PASM 
[SSK81],  the  Flow  Model  Processor  of  the  Numerical  Aerodynamic  Simulator 
[Bur70],  TRAC  [LiT77,  SUKSO],  STARAN  [Bat74),  PUMPS  [BFH82],  the 
Ballistic  Missile  Defense  Agency  distributed  processing  test  bed  [McW78], 
Ultracomputer  |GGK83],  and  data  flow  machines  [DBL80).  The  ESC  can  be 
used  in  any  of  these  systems  to  provide  fault  tolerance  in  addition  to  the  usual 
cube-type  network  communication  capability. 
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The  ESC  can  be  used  in  various  ways  in  different  computer  systems.  For 
example,  consider  how  the  ESC  could  be  incorporated  in  the  PASM  and 
PUMPS  systems.  In  the  context  of  PASM  (see  Figure  2.2),  the  network  would 
operate  in  a  unidirectional,  PEl-to-PE  packet-switched  mode  [SiMSlbJ.  The 
PUMPS  MIMD  architecture  (see  Figure  2.13)  [BFH821  consists  of  multiple 
processors  with  local  memories  which  share  special  purpose  peripheral 
processors,  VLSI  functional  units,  and  a  common  main  memory.  The  network 
would  serve  in  a  bidirectional,  circuit-switched  environment  for  this 
architecture,  connecting  local  memories  to  the  common  main  memory. 


3.2  Definition  of  the  Generalized  Cube  Network 

The  Generalized  Cube  network  is  a  multistage  cube-type  network  topology 
which  was  presented  in  [SiS78].  This  network  has  N  input  ports  and  N  output 
ports,  where  N  =  2".  Figure  3.1  shows  it  for  N=8.  The  network  ports  are 
numbered  from  0  to  N-1.  Input  and  output  ports  are  network  interfaces  to 
external  devices,  called  sources  and  destinations,  respectively,  and  have 
addresses  corresponding  to  their  port  numbers.  The  Generalized  Cube  topology 
has  n  =  log2N  stages,  where  each  stage  consists  of  a  set  of  N  links  connected 
to  N/2  switches.  Thus,  the  Generalized  Cube  has  0(N  logN)  physical 
components.  Note  that  neither  a  network  input  port  nor  output  port  is 
considered  a  link. 

Each  switch  is  an  interchange  box,  a  2-input/2-output  device,  and  is 
individually  controlled.  An  interchange  box  can  be  set  to  one  of  four  states. 
Assign  the  label  i  to  the  upper  input  and  output  ports  of  a  box,  and  the  label  j 
to  the  lower  input  and  output  ports.  The  four  functional  states  are:  1)  straight 
-  input  i  to  output  i,  input  j  to  output  j;  2)  exchange  •  input  i  to  output  j. 
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input  j  to  output  i;  3)  lower  broadcast  -  input  j  to  outputs  i  and  j;  and  4)  upper 
broadcast  -  input  i  to  outputs  i  and  j  [Law75].  Figure  3.1  shows  these  four 
states. 

All  Generalized  Cube  links  and  interchange  boxes  are  assigned  labels.  The 
input  ports  of  the  boxes  of  the  input  stage  are  labeled  by  the  number  of  the 
network  input  port  to  which  they  are  connected.  In  all  stages  the  upper  output 
of  a  box  has  the  same  label  as  the  upper  input  of  that  box;  the  same 
relationship  holds  true  for  the  lower  box  input  and  output  labels.  Links  take 
the  label  of  the  box  output  to  which  they  are  connected.  Links  give  their  label 
to  the  box  inputs  to  which  they  attach.  This  is  shown  in  Figure  3.1. 

An  interconnection  network  in  an  SIMD  environment  can  be  described  as 
a  set  of  interconnection  functions,  where  each  is  a  permutation  (bijection)  on 
the  set  of  PE  addresses,  or  network  input/output  (I/O)  port  labels  [Sie77a]. 
When  interconnection  function  f  is  applied,  network  input  S  is  connected  to 
output  f(S)  =  D  for  all  S,  0  <  S  <  N,  simultaneously.  That  is,  saying  that  the 
interconnection  function  maps  the  source  address  S  to  the  destination  address 
D  is  equivalent  to  saying  the  interconnection  function  causes  data  sent  on  the 
input  port  with  address  S  to  be  routed  to  the  output  port  with  address  D. 
SIMD  systems  typically  route  data  simultaneously  from  each  network  input  via 
a  sequence  of  interconnection  functions  to  each  output.  For  MIMD  systems, 
communication  from  one  source  is  typically  independent  of  other  sources.  In 
this  situation  the  interconnection  function  is  viewed  as  being  applied  to  the 
single  source,  rather  than  all  sources. 

The  connections  in  the  Generalized  Cube  are  based  on  the  cube 
interconnection  functions  [Sie77a].  Let  Pn-i"-PiPo  1'^®  binary  representation 
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of  P,  0  <  P  <  N.  Then  the  d  cube  intercoDoection  functions  are  defined  as 
Cubei(Pn_i...PiPo)  =  Pn-l  -Pi  +  lPiPi-l  -PlPo 

0<i<n,  0<P<N,  and  pj  denotes  the  complement  of  p;.  This  means  that 
the  cube  interconnection  function  maps  P  to  cubei(P),  where  cubei(P)  is  a  label 
that  differs  from  the  label  P  in  just  the  i^**  bit  position.  Stage  i  of  the 
Generalized  Cube  topology  contains  the  cube;  interconnection  function.  The 
input  labels  of  each  stage  i  box  differ  in  the  i^^  bit  position,  hence,  the 
exchange  setting  will  implement  cubej  in  mapping  the  box  input  labels  to  the 
box  outputs.  Thus,  data  items  input  to  that  interchange  box  are  transferred  as 
specified  by  the  cube-,  interconnection  function.  When  set  to  straight,  data 
items  input  are  transferred  according  to  the  identity  function,  where 
identity(p„_,...po)  =  Pn-j.-  Po-  Since  each  interchange  box  is  individually 
controlled,  each  stage  i  may  perform  the  cube;  interconnection  function  on 
some  subset  of  the  data  items  depending  on  the  settings  of  the  interchange 
boxes.  Stage  i  b  the  only  stage  which  can  map  a  source  to  a  destination  with 
an  address  different  from  the  source  address  in  the  i*'*‘  bit  position. 

This  network  is  called  the  Generalized  Cube  by  virtue  of  a  geometric 
relationship  that  can  be  establbhed  among  the  network  port  addresses. 
Network  port  addresses  can  be  assigned  to  the  corners  of  an  n-dimensional 
cube  is  such  a  way  that  all  address  pairs  of  the  form  {P,  cubei(P))  are  on 
adjacent  corners.  Further,  for  a  given  i  all  edges  defined  by  {P,  cubej{P)}  are 
parallel.  Data  transfer  in  the  Generalized  Cube  can  be  vbualized  as  movement 
along  hypercube  edges  in  a  fixed  sequence  of  dimensions.  Figure  3.2  shows  a 
three-dimensional  cube  with  the  vertices  labeled  in  base  2  to  correspond  to  the 
Generalized  Cube  for  N=8.  The  edges  of  the  cube  correspond  to  the  three 
cube  functions  for  a  network  of  thb  size.  Vertically  oriented  edges  correspond 
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to  the  cube2  functioD,  as  they  connect  vertices  with  labels  differing  in  the  high 
order,  or  second,  bit  position.  Diagonal  edges  represent  the  cube]  function,  and 
horizontal  edges  the  cubeg  function. 

There  is  a  class  of  cube-type  networks  of  which  the  Generalized  Cube  is 
representative.  By  combining  the  results  of  (Sie79a,  SiS78,  FeW81,  WuF80]  it 
can  be  seen  that  all  of  the  following  networks  are  topologically  equivalent:  the 
Generalized  Cube  [SiS78j,  the  STARAN  flip  network  [Bat76|,  the  omega 
network  [Law75],  and  the  indirect  binary  n-cube  network  (Pea77].  (The  SW- 
banyan  (S=F=2,  L=n)  is  defined  as  a  graph  [GoL73]  and  has  the  same 
topology  as  a  multistage  cube  [WuFSO].)  For  this  reason  the  Generalized  Cube 
can  be  used  as  a  standard  for  comparing  cube-type  networks  with  other 
interconnection  networks.  In  this  case  the  comparison  will  be  with  the  Extra 
Stage  Cube. 

More  information  on  the  Generalized  Cube  can  be  found  in  [SiM81b, 
Sie85].  Implementation  details  are  discussed  in  [McS80]. 

3.3  Properties  of  the  Generalized  Cube  Network 

The  Generalized  Cube  has  been  shown  to  have  many  useful  properties, 
summarized  below.  This  listing  is  provided  as  background  for  considering  the 
ESC  network.  The  ESC  has  all  of  the  properties  of  the  Generalized  Cube 
network  in  addition  to  certain  other  capabilities  discussed  in  Chapter  4. 

The  Generalized  Cube  can 

1.  handle  up  to  N  simultaneous  transfers  of  information; 

2.  be  controlled  in  a  distributed  fashion  using  routing  tags; 


3.  be  partitioned  into  independent  subnetworks; 

4.  perform  broadcasting  of  information  from  one  device  to  all  or  a  subset  of 
devices  using  the  network; 

5.  function  in  SIMD,  MSIMD,  MIMD,  and  partitionable  SIMD/MIMD 
modes  of  parallel  processing;  and 

6.  be  implemented  in  numerous  ways. 

3.4  Definition  of  the  Extra  Stage  Cube  Network 

The  ESC  is  formed  from  the  Generalized  Cube  by  adding  an  extra  stage 
along  with  a  number  of  multiplexers  and  demultiplexers.  Figure  3.3  illustrates 
ESC  network  structure  for  N  =  8.  The  extra  stage,  stage  n,  is  placed  on  the 
input  side  of  the  network  and  implements  the  cube^  interconnection  function. 
Thus,  there  are  two  stages  in  the  ESC  which  can  perform  cubep.  The 
incremental  increase  in  hardware  complexity  for  the  ESC  relative  to  the 
Generalized  Cube  is  0(N);  overall  ESC  hardware  complexity  remains 
0(N  logN).  Thus,  it  has  relatively  low  incremental  cost  over  the  Generalized 
Cube  network. 

For  the  ESC,  as  for  the  Generalized  Cube,  a  stage  is  considered  to  consist 
of  a  bank  of  interchange  boxes  and  the  links  from  their  outputs.  Thus,  the 
links  connecting  stage  i  boxes  to  stage  i-1  boxes  are  stage  i  links.  This 
definition  is  used  for  the  consistent  treatment  it  allows  of  both  boxes  and  links 
with  respect  to  the  fault  tolerance  discussion  of  Chapter  4.  The  supporting 
reason  for  this  decision  is  given  in  Chapter  4,  Section  4.3.3.  Note  that  there 
are  no  stage  0  links  under  this  definition  (recall  that  network  ports  are  not 
considered  links). 
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Stage  n  and  stage  0  can  each  be  enabled  or  disabled  (bypassed).  A  stage  b 
enabled  when  its  interchange  boxes  are  being  used  to  provide  interconnectioo. 
It  is  disabled  when  its  interchange  boxes  are  being  bypassed.  Enabling  and 
disabling  in  stages  n  and  0  is  accomplished  with  a  demultiplexer  at  each  box 
input  and  a  multiplexer  at  each  output.  Figure  3.4  details  an  interchange  box 
from  stage  n  or  0.  One  demultiplexer  output  goes  to  a  box  input,  the  other  to 
an  input  of  its  corresponding  multiplexer.  The  remaining  multiplexer  input  is 
from  the  matching  box  output.  The  demultiplexer  and  multiplexer  are 
configured  such  that  they  either  both  connect  to  their  box  (enable)  or  both 
shunt  it  (disable).  All  demultiplexers  and  multiplexers  for  stage  n  share  a 
common  control  signal;  those  in  stage  0  also  share  a  common  control  signal. 

The  ESC  uses  stage  n  multiplexers  to  prevent  stage  n  box  output  failures 
from  affecting  stage  n  links.  The  multiplexers  provide  the  links  with  isolation 
from  box  output  stuck-Iogic- level  faults.  The  stage  0  input  demultiplexers 
serve  the  same  function  for  stage  1  links.  Preventing  these  box  faults  from 
affecting  associated  links  is  important  to  the  fault  tolerance  capabilities  of  the 
ESC,  which  are  discussed  in  Chapter  4. 

Stage  n  and  0  enabling  and  disabling  are  performed  by  a  system  control 
unit.  Normally,  the  network  will  be  set  so  that  stage  n  is  disabled  and  stage  0 
is  enabled.  The  resulting  structure  is  that  of  the  Generalized  Cube.  Figure  3.5 
shows  stage  n  disabled,  stage  0  enabled,  and  the  path  from  input  2  to  output  1 
under  this  circumstance  for  an  ESC  with  N=8.  If  after  running  fault 


detection  and  location  tests  a  fault  is  found,  the  network  is  reconfigured.  A 
fault  in  a  stage  n  box  requires  no  change  in  network  configuration;  stage  n 
remains  disabled.  If  the  fault  is  in  stage  0  then  stage  n  is  enabled  and  stage  0 
is  disabled.  Figure  3.6  shows  an  ESC  with  N  =  8  and  stage  n  enabled,  stage  0 


Figure  3.4  (a)  Detail  of  interchange  box  with  multiplexer  and  demultiplexer 

for  enabling  and  disabling,  (b)  Interchange  box  enabled,  (c) 
Interchange  box  disabled. 


Figure  3.5  The  path  from  input  2  to  output  1  in  the  ESC  for  N  =8,  when 
stage  n  is  disabled  and  stage  0  b  enabled.  Stage  n  and  0 
multiplexer  and  demultiplexer  settings  are  shown  explicitly 


disabled,  and  the  path  from  input  2  to  output  1  in  this  case.  For  a  fault  in  a 
link  or  in  a  box  in  stages  n-1  to  1,  both  stages  n  and  0  will  be  enabled. 
Enabling  both  stages  n  and  0  provides  tolerance  to  this  type  of  fault  by 
providing  two  paths  between  any  source  and  destination,  only  one  of  which  can 
contain  the  existing  fault.  This  is  discussed  in  detail  in  Chapter  4. 

Intuitively,  for  both  the  Generalized  Cube  and  the  ESC,  stage  i, 
0  <  i  <  n,  determines  the  i^^  bit  of  the  address  of  the  output  port  to  which 
the  data  is  sent.  Consider  the  route  from  source  S  =  Sq-]...S]So  to  destination 
D  =  d„-i...d|dQ.  If  the  route  passes  through  stage  i  using  the  straight 
connection  then  the  i*'*'  bit  of  the  source  and  destination  addresses  will  be  the 
same,  i.e.,  d-,  =  Sj.  If  the  exchange  setting  b  used,  the  i*’^  bits  will  be 
complementary,  i.e.,  dj  =  Sj.  In  the  Generalized  Cube,  stage  0  determines  the 
0***  bit  position  of  the  destination  in  a  similar  fashion.  In  the  ESC,  however, 
both  stage  n  and  stage  0  can  affect  the  0^**  bit  of  the  output  address.  Using  the 
straight  connection  in  stage  n  perforins  routings  as  they  occur  in  the 
Generalized  Cube.  The  exchange  setting  makes  available  an  alternate  route 
not  present  in  the  Generalized  Cube.  In  particular,  the  route  enters  stage  n— 1 
at  label  Sn_i...SjSo,  instead  of  s„_i...SiS(). 

For  convenience  in  later  discussion,  the  term  network  component  is  defined 
to  be  any  element  comprising  the  structure  of  the  ESC.  Thus,  the  term  could 
refer  to  an  interchange  box,  a  link,  a  multiplexer,  or  a  demultiplexer. 


3.6  Conclasions 


This  chapter  has  presented  the  Extra  Stage  Cube  interconnection  network. 
The  Generalized  Cube  network  was  defined  and  the  derivation  of  the  ESC  from 
it  was  shown.  The  ESC  is  only  slightly  more  physically  complex  that  its 
parent  network.  Properties  of  the  Generalized  Cube,  which  are  also  properties 
of  the  ESC,  were  listed.  Its  extra  stage  of  switching  elements  and  its  bypass 
circuitry  allow  the  ESC  to  tolerate  faults,  a  property  not  shared  with  the 
Generalized  Cube  network.  Operation  of  ESC  bypass  circuitry  was  described. 
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CHAPTER  4 

PROPERTIES  OF  THE  EXTRA  STAGE  CUBE 
INTERCONNECTION  NETWORK 


4.1  Introduction 

The  discussion  in  Chapter  2  of  parallel  systems  has  demonstrated  that  an 
important  component  of  a  parallel  computer  is  a  mechanism  for  information 
transfer  between  the  various  system  components.  Because  of  system 
complexity,  assuring  high  reliability  is  a  significant  task.  Thus,  a  crucial  aspect 
of  an  interconnection  network  used  to  meet  system  communication  needs  is 
fault  tolerance. 

The  formal  description  of  the  fault  tolerance  of  an  interconnection 
network  requires  the  definition  of  both  a  fault  model  and  fault  tolerance 
criterion.  With  these  in  hand  the  fault  tolerance  of  the  ESC  will  be  studied. 
After  considering  the  effects  of  a  single  fault,  the  response  of  the  ESC  to 
multiple  faults  is  considered  as  well,  because  of  the  desirability  of  graceful 
degradation  capability.  Other  important  aspects  of  any  interconnection 
network  include:  method  of  control,  routing  capability,  partitioning  capability, 
and  permuting  capability.  Each  of  these  issues  is  explored  in  this  chapter.  All 
will  be  related  to  the  fault  tolerance  of  the  ESC,  demonstrating  the  feasibility 
of  achieving  fault  tolerant  communication  for  a  parallel  computer  with  this 
network. 
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4.2  Fault-Tolerance  Model 


Networks  that  can  continue,  in  at  least  some  cases,  to  provide 
interconnection  service  even  when  they  contain  faulty  components  are  said  to 
be  fault-tolerant.  A  network  is  termed  single-fault  tolerant  if  it  can  function  in 
spite  of  a  single  fault.  If  up  to  i  faults  can  be  tolerated  then  the  network  is  t- 
fault  tolerant.  A  network  is  robust  if  it  can  tolerate  some  instances  of  i  faults, 
but  is  not  i-fault  tolerant.  A  fault  is  hard  if  it  is  not  of  a  transient  nature;  all 
faults  will  be  assumed  hard  unless  otherwise  stated. 

It  is  only  meaningful  to  speak  of  a  network  as  i-fault  tolerant  with  regard 
to  a  particular  fault-tolerance  model.  A  fault-tolerance  model  consists  of  two 
components.  The  first  is  the  fault  model,  which  characterizes  all  faults  that  are 
assumed  to  occur  in  the  network.  The  fault  model  for  a  given  network  may  or 
may  not  correspond  closely  to  actual  or  predicted  experience  with  hardware. 
In  particular,  fault  models  are  often  chosen  to  have  characteristics  suited  for 
performing  an  analysis,  even  if  those  characteristics  may  depart  widely  from 
reality.  The  second  component  is  the  fault-tolerance  criterion,  the  condition 
that  must  be  met  for  the  network  to  be  said  to  have  tolerated  a  given  fault  or 
faults.  This  criterion  varies  due  to  differences  in  the  definition  of  what 
constitutes  functionality  for  a  given  network  (basically,  what  amount  of 
degradation  from  the  fault-free  condition  is  allowed). 

The  fault  model  chosen  for  the  ESC  is  the  following. 


1.  Any  interchange  box  or  link  can  fail. 
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2.  Faulty  boxes  and  links  are  considered  unusable  until  such  time  as  the  fault 
is  remedied. 

3.  Faults  occur  independently. 

4.  All  interchange  boxes  have  identical  reliability,  which  may  differ  from  the 
identical  reliability  characterizing  all  links. 

5.  Input  stage  multiplexers  and  output  stage  demultiplexers  have  no  effect  on 
the  reliability  of  their  associated  links. 

Item  1  of  the  fault  model  implies  that  network  ports,  input  demultiplexers,  and 
output  multiplexers  are  always  fault-free.  Devices  cannot  access  the  network  if 
these  components  fail.  Chapter  7  shows  that  physical  implementation  of  the 
ESC  need  not  involve  input  demultiplexers  nor  output  multiplexers. 

The  ESC  fault-tolerance  criterion  is  retention  of  full  access  [CiS82).  Full 
access  capability  is  the  ability  to  connect  any  given  input  to  any  output,  a 
property  of  the  fault-free  network.  This  implies  that  the  ESC  with  a  single 
fault  retains  its  fault-free  interconnection  capability. 

Once  a  fault  has  been  detected  and  located  in  the  ESC,  the  failing  portion 
of  the  network  is  considered  unusable  until  such  time  as  the  fault  is  remedied. 
Specifically,  if  an  interchange  box  is  faulty,  data  will  not  be  routed  through  it, 
nor  will  data  be  passed  over  a  faulty  link.  The  extra  stage  of  the  ESC  does 
increase  the  likelihood  of  a  fault,  compared  to  the  Generalized  Cube,  due  to  the 
additional  hardware.  It  should  also  be  noted  that  a  failure  in  a  stage  n 
multiplexer  or  stage  0  demultiplexer  has  the  effect  of  a  link  fault,  which  the 
ESC  can  tolerate,  as  shown  in  Section  4.3. 

Techniques  such  as  test  patterns  [FeK82,  FeW81,  Lim82,  LeS83],  dynamic 
parity  checking  [SiM81b,  LLY82),  or  write/read-back/verify  (ThN83]  for  fault 


k-'-  i-y. 


■  a.-. -VlY. 


V*  V  fc'wV'v  .-'ll  _ 


.• ‘.1 

■  .  •  .  'A 


detectioD  and  location  have  been  described  for  use  in  the  Generalized  Cube 
topology.  Test  patterns  are  used  to  determine  network  integrity  globally  by 
checking  the  data  arriving  at  the  network  outputs  as  a  result  of  N  strings  (one 
per  input  port)  of  test  inputs.  With  dynamic  parity  checking,  each  interchange 
box  monitors  the  status  of  boxes  and  links  connected  to  its  inputs  by 
examining  incoming  data  for  correct  parity.  It  is  assumed  that  the  ESC  can  be 
tested  to  determine  the  existence  and  location  of  faults.  This  chapter  is  not 
concerned  with  the  procedures  to  accomplish  this,  but  rather  with  how  to 
recover  once  a  fault  is  located.  Recovery  from  such  a  fault  is  something  of 
which  the  Generalized  Cube  and  its  topologically  equivalent  networks  are 
incapable.  The  Generalized  Cube  is  neither  single-fault  tolerant  nor  robust 
with  respect  to  the  ESC  fault  model  and  fault- tolerance  criterion. 

4.3  Single-Fault  Tolerance 

The  ESC  achieves  its  fault-tolerant  capabilities  by  having  redundant  paths 
and  broadcast  paths.  This  section  formally  establishes  the  existence  of  these 
redundant  paths.  Then  procedures  for  Gnding  a  fault-free  path  or  broadcast 
path  in  a  faulted,  but  functional,  ESC  are  set  forth. 

4.3.1  One-to-One  Connections 

The  first  step  towards  establishing  the  fault  tolerance  of  the  ESC  is  to 
show  that  it  can  provide  more  than  one  path  between  any  source  and 
destination. 


Theorem  4.I:  In  the  ESC  with  both  stages  n  and  0  enabled  there  exist  exactly 
two  paths  between  any  source  and  any  destination. 

Proof:  There  is  exactly  one  path  from  a  source  S  to  a  destination  D  in  the 
Generalized  Cube  [Law75].  Stage  n  of  the  ESC  allows  access  to  two  distinct 
stage  n~l  inputs,  S  and  cubeo(S).  Stages  n— 1  to  0  of  the  ESC  form  a 
Generalized  Cube  topology,  so  the  two  stage  n~l  inputs  each  have  a  single 
path  to  the  destination  and  these  paths  are  distinct  (since  they  differ  at  stage 
n  - 1  at  least). 

□ 

Figure  4.1  shows  the  two  paths  available  between  input  1  and  output  4  in 
an  ESC  with  N  =  8  when  both  stages  n  and  0  are  enabled.  The  settings  of  the 
stage  n  and  0  multiplexers  and  demultiplexers  are  shown  to  explicitly  depict 
stages  n  and  0  enabled.  One  path  from  1  to  4  uses  the  straight  setting  in  stage 
n;  the  other  uses  the  exchange  setting. 

The  existence  of  at  least  two  paths  between  any  source/destination  pair  is 
a  necessary  condition  for  fault  tolerance.  Redundant  paths  allow  continued 
communication  between  source  and  destination  if  at  least  one  path  remains 
functional  after  a  fault.  It  can  be  shown  that  for  the  ESC  two  paths  are 
sufficient  to  provide  tolerance  to  single  faults  for  one-to-one  connections. 

Lemma  4-l‘  The  two  paths  between  a  given  source  and  destination  in  the  ESC 
with  stages  n  and  0  enabled  have  no  links  in  common. 


Proof:  A  source  S  can  connect  to  the  stage  n  - 1  inputs  S  or  cubeQfS).  These 
two  inputs  differ  in  the  O***,  or  low-order,  bit  position.  Other  than  stage  n,  only 
stage  0  can  cause  a  source  to  be  mapped  to  a  destination  which  differs  from  the 
source  in  the  low-order  bit  position.  Therefore,  the  path  from  S  through  stage 
n-1  input  S  to  the  destination  D  contains  only  links  with  labels  which  agree 
with  S  in  the  low-order  bit  position.  Similarly,  the  path  through  stage  n-1 
input  cubeo(S)  contains  only  links  with  labels  agreeing  with  cubeo(S)  in  the 
low-order  bit  position.  Thus,  no  link  is  part  of  both  paths. 

□ 

Lemma  4.2:  The  two  paths  between  a  given  source  and  destination  in  the  ESC 
with  stages  n  and  0  enabled  have  no  interchange  boxes  from  stage  n-1 
through  1  in  common. 

Proof:  Since  the  two  paths  have  the  same  source  and  destination,  they  will 
pass  through  the  same  stage  n  and  0  interchange  boxes.  No  box  in  stages  n  —  1 
through  1  has  input  link  labels  that  differ  in  the  low-order  bit  position.  One 
path  from  S  to  D  contains  only  links  with  labels  agreeing  with  S  in  the  low- 
order  bit  position.  The  other  path  has  only  links  with  labels  that  are  the 
complement  of  S  in  the  low-order  bit  position.  Therefore,  no  box  in  stages 
n-1  through  1  belongs  to  both  paths. 

□ 

Theorem  4-^'  fn  the  ESC  with  a  single  fault  there  exists  at  least  one  fault-free 
path  between  any  source  and  destination. 


»^C7  C21  DISTRIBUTED  COHPUTINO  FOR  SIONRI 
OF  ASVNCHRONOUS  PAR. .  <U>  PURDUE 
Q  B  ADAHS  DEC  84  ARO-18798. 17-El 

UNCLASSIFIED 

PROCESSIHO:  NODELINQ  2/4  ^|| 

UNIV  LAFAVETTE  IN  ifl 

.-APP-C  DAAQ29-82-K-S1S1  I 

F/O  9/2  NL  1 

1 

1 

J5 

■■■ 

■^1 

1  •  '  ~ _ \ 

Proof:  Assume  first  that  a  link  is  faulty.  If  both  stages  n  and  0  are  enabled, 
Lemma  4.1  implies  that  at  most  one  of  the  paths  between  a  source  and 
destination  can  be  faulty.  Hence,  a  fault-free  path  exbts. 

Now  assume  that  an  interchange  box  is  faulty.  There  are  two  cases  to 
consider.  If  the  faulty  box  is  in  stage  n  or  0,  the  affected  stage  can  be  disabled. 
The  remaining  n  stages  are  sufficient  to  provide  one  path  between  any  source 
and  destination  (i.e.,  all  n  cube  functions  are  still  available).  If  the  faulty  box 
is  not  in  stage  n  or  0,  Lemma  4.2  implies  that  if  both  stages  n  and  0  are 
enabled  then  at  most  one  of  the  paths  is  faulty.  So,  again,  a  fault-free  path 
exists. 

Two  paths  exist  when  the  fault  is  in  neither  of  the  two  paths  between 
source  and  destination. 

O 

Figure  4.2  shows  one  example  of  how  the  ESC  tolerates  a  single  fault.  A 
stage  2  link  is  indicated  as  faulty  by  the  dashed  line.  The  two  possible  paths 
from  stage  n  when  it  is  enabled  are  shown,  since  it  will  be  enabled  due  to  the 
existence  of  a  fault  in  a  network  component  other  than  a  stage  n  box.  Despite 
the  fact  that  stage  0  is  bypassed  as  a  result  of  the  fault,  one  of  these  paths 
remains  fault-free. 

4.3.2  Broadcast  Connections 

The  two  paths  between  any  source  and  destination  of  the  ESC  provide 


fault  tolerance  for  performing  broadcasts  as  well. 
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Theorem  4.S:  In  the  ESC  with  both  stages  n  and  0  enabled  there  exist  exactly 
two  broadcast  paths  for  any  broadcast. 

Proof:  A  one-to-many  broadcast  path  is  just  a  collection  of  one-to-one 
connections  with  the  same  source.  There  is  exactly  one  broadcast  path  from  a 
source  to  its  destinations  in  the  Generalized  Cube  (SiS78,  SiMSlbJ.  Stage  n  of 
the  ESC  allows  a  source  S  access  to  two  distinct  stage  n-1  inputs,  S  and 
cube(,(S).  Any  set  of  destinations  to  which  S  can  broadcast,  cubeo(S)  can  also 
broadcast,  because  any  broadcast  can  be  performed. 

□ 

Figure  4.3  illustrates  the  two  broadcast  paths  available  to  connect  input  0 
to  outputs  2,  3,  6,  and  7  in  an  ESC  with  N  =  8  when  both  stage  n  and  0  are 
enabled.  The  bold  line  indicates  one  complete  broadcast  path,  the  dashed  line 
denotes  the  other. 

Lemma  4.S:  The  two  broadcast  paths  between  a  given  source  and  its 
destinations  in  the  ESC  with  stages  n  and  0  enabled  have  no  links  in  common. 

Proof:  All  links  in  the  broadcast  path  from  the  stage  n-1  input  S  have  labels 
which  agree  with  S  in  the  low-order  bit  position.  All  links  in  the  broadcast 
path  from  the  stage  n-1  input  cubeQ(S)  are  the  complement  of  S  in  the  low- 
order  bit  position.  Thus,  no  link  is  part  of  both  broadcast  paths. 


network  for  N— 8.  The  bold 
broadcast  path,  the  dashed  lines  d' 


Lemma  4  4'  The  two  broadcast  paths  between  a  given  source  and  its 
destinations  in  the  ESC  with  stages  n  and  0  enabled  have  no  interchange  boxes 
from  stage  n-1  through  1  in  common. 

Proof:  Since  the  two  broadcast  paths  have  the  same  source  and  destinations, 
they  will  pass  through  the  same  stage  n  and  0  interchange  boxes.  No  box  in 
stages  n-1  through  1  has  input  link  labels  which  differ  in  the  low-order  bit 
position.  From  the  proof  of  Lemma  4.3,  the  link  labeb  of  the  two  broadcast 
paths  differ  in  the  low-order  bit  position.  Therefore,  no  box  in  stages  n-1 
through  1  belongs  to  both  broadcast  paths. 

□ 

Lemma  4-^-  With  stage  0  disabled  and  stage  n  enabled,  the  ESC  can  form  any 
broadcast  path  which  can  be  formed  by  the  Generalized  Cube. 

Proof:  Stages  n  through  1  of  the  ESC  provide  a  complete  set  of  n  cube 
interconnection  functions  in  the  order  cubeg,  cubep.i,  ...,  cubci.  ^  path  exists 
between  any  source  and  destination  with  stage  0  disabled  because  all  n  cube 
functions  are  available.  This  is  regardless  of  the  order  of  the  interconnection 
functions.  So,  a  set  of  paths  connecting  an  arbitrary  source  to  any  set  of 
destinations  exists.  Therefore,  any  broadcast  path  can  be  formed. 

□ 

Theorem  4  4-  J*'  t^he  ESC  with  a  single  fault  there  exists  at  least  one  fault-free 
broadcast  path  for  any  broadcast  performable  by  the  Generalized  Cube. 


Proof:  Assume  the  fault  is  in  stage  0,  i.e.,  disable  stage  0,  enable  stage  n. 
Lemma  4.5  implies  a  fault-free  broadcast  path  exists.  Assume  the  fault  is  in  a 
link  or  a  box  in  stages  n-1  to  1.  From  Lemmas  4.3  and  4.4,  the  two 
broadcast  paths  will  have  none  of  these  network  elements  in  common. 
Therefore,  at  least  one  broadcast  path  will  be  fault-free,  possibly  both.  Finally, 
assume  the  fault  is  in  stage  n.  Stage  n  will  be  disabled  and  the  broadcast 
capability  of  the  ESC  will  be  the  same  as  that  of  the  Generalized  Cube. 

□ 

4.3.3  Finding  Fault-Free  Paths 

The  ESC  path  that  routes  S  to  D  and  corresponds  to  the  Generalized 
Cube  path  from  S  to  D  is  called  the  primary  path.  This  path  must  either 
bypass  stage  n  or  use  the  straight  setting  in  stage  n.  The  other  path  available 
to  connect  S  to  D  is  the  secondary  path.  It  must  use  the  exchange  setting  in 
stage  n.  The  concept  of  primary  path  can  be  extended  for  broadcasting.  The 
broadcast  path,  or  set  of  paths,  in  the  ESC  analogous  to  that  available  in  the 
Generalized  Cube  is  called  the  primary  broadcast  path.  This  is  because  each 
path  from  the  source  to  one  of  the  destinations  is  a  primary  path.  If  every 
primary  path  is  replaced  by  its  secondary  path  the  result  is  the  secondary 
broadcast  path. 

Given  S  and  D,  the  network  links  and  boxes  used  by  a  path  can  be  found. 
As  discussed  in  [Law75],  for  the  source/destination  pair  S  with  binary 
representation  Sn.|...s,So  and  D  with  binary  representation  dn_,...djdo  the  path 
followed  in  the  Generalized  Cube  topology  uses  the  stage  i  output  labeled 
d„^j...d|  + id|Sj_|...S|So.  The  following  theorem  extends  this  for  the  ESC.  Note 
that  given  y^,_|...X|Xo  and  yn-i  •■•yiyo>  notational  convention  that 
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x„_j...Xi  +  iXiyi_,...y,yo  =  y„-i  .yiyo.  »s  used.  Also, 

x„_,...Xi+,x,yi_i...y,yo  =  x„_i...x,Xo,  ifi  =  0. 

Theorem  4-5:  For  the  source/destinatiou  pair  S  =  s„_|...SiSq  and 
D  =  d„_i...dido,  the  primary  path  uses  the  stage  i  output  labeled 
dn_j...di  +  idjSj_|...SjSQ  and  the  secondary  path  uses  d„_|...dj +  jdjSi_j...SjSo,  for 
0  <  i  <  n. 

Proof:  Stage  i,  i  0,  is  the  only  stage  in  the  ESC  that  can  map  the  i*’*‘  bit  of  a 
source  address  (i.e.,  determine  the  i*'*‘  bit  of  the  destination).  Thus,  if  S  is  to 
reach  D  both  ESC  paths  must  use  a  stage  i  output  with  a  label  that  matches  D 
in  the  i^**  bit  position.  This  matching  occurs  at  each  stage,  so  the  high-order 
n-i  bits  of  the  output  label  will  be  d„_i...di+jdi.  At  the  output  of  stage  i,  bit 
positions  i-1  to  1  have  yet  to  be  affected  so  they  match  source  address  bits 
i-1  to  1.  The  low-order  bit  position  is  unchanged  by  stage  n  for  the  primary 
path.  The  secondary  path  includes  the  cubeQ  connection  (exchange)  in  stage  n, 
therefore  the  low-order  bit  position  is  complemented. 

□ 


When  a  fault  has  been  detected  and  located,  each  source  will  receive  a 
fault  label,  or  labels,  uniquely  specifying  the  fault  location.  This  is 
accomplished  by  giving  a  stage  number  and  a  stage  output  number.  For 
example,  if  the  link  between  stages  i  and  i-1  from  the  stage  i  output  j  fails, 
each  source  receives  the  fault  label  {i,j).  If  a  box  in  stage  i  with  outputs  j  and 
k  fails,  the  pair  of  fault  labels  (i,j)  and  (i,k)  is  sent  to  each  source.  Since  j  and 
k  differ  only  in  the  i‘^  bit  position  a  single  fault  label  (i,  h„_i...hi  +  |Xhi_i...hiho) 


can  be  sent,  where  h^=j,n=k„  for  0  <  m  <  n-1  and  m  i.  The  “x” 
represents  both  a  1  and  a  0.  For  a  fault  in  stage  0  no  fault  label  will  be  given, 
only  notice  that  a  stage  0  fault  exists.  This  is  because  stage  0  will  be  disabled 
in  the  event  of  such  a  fault,  so  no  path  could  include  a  stage  0  fault.  Note  that 
the  only  possible  faults  in  stage  0  are  box  faults.  Stage  n  box  faults  require 
system  maintenance,  but  no  fault  labels  need  be  issued,  as  the  stage  will  be 
disabled.  A  label  is  issued  in  the  event  of  a  stage  n  link  fault.  Note  also  that 
the  number  of  faults  may  not  equal  the  number  of  fault  labels. 

If  a  network  fault  lies  outside  stage  n  or  0  boxes,  a  source  can  check  the 
primary  path  to  its  intended  destination  for  the  faulty  network  component. 
For  any  faulty  link  or  a  faulty  box  in  stage  i,  1  <  i  <  n,  the  source  forms 
dn_j...di  +  idjSi_i...SjSo  and  compares  this  with  the  stage  output  numbers  of  the 
fault  label(s).  (Note  that  if  there  is  a  faulty  stage  n  link  Sn_|...8iS<)  is  used.)  If 
there  is  a  match  then  the  primary  path  is  faulty;  if  not,  it  is  fault-free.  If  the 
primary  path  is  fault-free  it  can  be  used.  If  faulty,  the  secondary  path  will  be 
fault-free  and,  hence,  usable. 

Since  a  broadcast  path  to  many  destinations  is  the  union  of  many  paths 
from  the  source  to  the  destinations,  exhaustive  checking  to  see  if  one  of  these 
paths  contains  a  fault  may  involve  more  computational  effort  than  is  desirable. 
To  decide  if  the  secondary  broadcast  path  should  be  used,  a  simpler  criterion 
than  checking  each  path  for  the  fault  exists.  For  any  faulty  link  or  a  faulty 
box  in  stage  i,  1  <  i  <  n,  the  test  is  to  compare  the  low-order  i  bits  of  the 
source  address  and  the  stage  output  number(s)  of  the  faulty  link  or  box.  The 
primary  broadcast  paths  must  use  links  and  stage  n-1  to  1  boxes  with  stage 
output  numbers  that  agree  with  the  source  address  in  the  low-order  bit 
positions.  Thus,  if  the  low-order  i  bits  of  the  fault  label  stage  output 


number(s)  and  the  source  address  agree,  the  fault  may  lie  in  the  primary 
broadcast  path.  (The  fault  is  in  the  primary  broadcast  path  if  the  high-order 
n-i  bits  of  the  stage  output  number  match  the  corresponding  bits  of  any 
destination  of  the  broadcast.)  Using  the  secondary  broadcast  path  avoids  the 
possibility  of  encountering  the  fault.  This  method  is  computationally  simpler 
than  exhaustive  path  fault  checking,  but  can  result  in  unneeded  use  of  the 
secondary  broadcast  path. 

If  there  is  no  strong  preference  to  using  the  primary  versus  secondary  path 
(or  broadcast  path),  the  test  to  check  for  any  faulty  link  or  stage  i  box, 
1  <  i  <  n,  can  be  reduced  to  just  comparing  on  a  single  bit  position.  If  the 
low-order  source  address  and  fault  label  stage  output  number  bits  agree,  then 
the  primary  path  (or  primary  broadcast  path)  may  be  faulty,  so  the  secondary 
path  can  be  used.  This  simplified  procedure  will  result  in  unnecessary  use  of 
secondary  paths  (one-to-one),  and  more  unnecessary  use  of  secondary  broadcast 
paths  than  checking  the  low-order  i  bits. 

4.4  Multiple-Fault  Tolerance 

Theorems  4.2  and  4.4  establish  the  capability  of  the  ESC  to  tolerate  a 
single  fault  given  its  fault-tolerance  model.  That  is,  any  one-to-one  or 
broadcast  connection  possible  in  the  fault^free  Generalized  Cube  network 
remains  possible  in  the  ESC  despite  a  single  fault.  For  some  instances  of 
multiple  faults  the  ESC  also  retains  fault-free  interconnection  capability.  The 
necessary  and  sufficient  condition  for  this  is  that  the  primary  and  secondary 
paths  are  not  both  faulty. 

As  faults  are  detected  and  located  a  system  control  unit  can  determine 
whether  network  interconnection  capability  is  degraded. 


Theorem  4-6:  Let  A  =  (i,a„_|...aia0)  and  B  =  (j,bn_i...bibQ),  where 
1  <  j  <  i  <  n,  be  two  fault  labels.  If  a„_j...ai  +  ,aj  9^  b„_j...bi  + jb-,,  or  if 
aj-i-.-aiOo  ^  bj_j...bjbo,  then  there  will  be  at  least  one  fault-free  path  between 
any  source  and  destination. 

Proof:  A  fault-free  path  will  exist  for  a  source/destination  pair  S/D  if  taken 
together  the  fault  labels  A  and  B  do  not  indicate  blockage  of  both  the  primary 
and  secondary  paths.  As  shown  in  Theorem  4.5,  the  primary  path  uses  stage  i 
output  dn_i...di  +  idiSi_i...SiSo  and  the  secondary  path  uses  d„_j...di  + jdiSi_j...S|Si). 
The  stage  j  outputs  used  are  dn_i...dj  +  |djSj_|...SjS0  and  d„_i...dj +idjSj_j...S|So. 
Without  loss  of  generality  it  is  assumed  that  j  <  i.  Thus,  at  stages  i  and  j  the 
primary  and  secondary  paths  both  use  outputs  with  the  same  bits  in  positions 
n-1  through  i  and  j-1  through  1,  and  complementary  values  in  position  0.  If 
an_i...aj  +  jai  ^  bn_|...bj  +  jbi  then  at  least  one  of  the  faults  is  in  neither  the 
primary  nor  the  secondary  path,  so  at  least  one  of  the  paths  is  fault-free. 
Similarly,  if  aj_p..a,ao  /  bj-i...bibo  at  least  one  fault  is  in  neither  path,  so  at 
least  one  path  is  fault-free.  For  i  =  n  the  inequality 
an_i...ai  +  iaj  ^  b„_j...bi  +  ibi  does  not  exist,  and  only  the  constraint 
aj_i...aiaQ  ^  bj_,...b,bQ  applies. 

□ 

When  multiple  faults  are  detected  and  located  in  the  ESC,  a  system 
control  unit  must  determine  the  appropriate  action.  Theorem  4.6  is  applied  if 
the  multiple  faults  occur  in  links  or  in  boxes  in  stages  n-1  to  1.  The  fault 
label(s)  of  any  new  fault(s)  is  compared  with  all  existing  fault  label(s).  If  each 
pair  of  fault  labels  meets  the  test  of  Theorem  4.6,  then  the  network  retains  its 


fault-free  interconnection  capability.  (Note  that  the  two  fault  labels  associated 
with  a  faulty  box  do  satisfy  the  requirement  of  Theorem  4.6  since  for  stages 
n-1  through  1  the  low-order  bits  of  the  stage  output  numbers  of  such  labels 
agree,  satisfying  the  Theorem  4.6  criterion  aj_i...a]^  ^  bj_j...bjbo.) 

With  multiple  stage  0  or  multiple  stage  n  faults  only,  the  affected  stage  is 
simply  disabled,  as  for  a  single  fault  in  either  stage;  fault-free  interconnection 
capability  still  exists.  If  faults  exist  in  boxes  of  both  stages  n  and  0,  the  stages 
are  both  disabled,  and  the  network  cannot  perform  cubep.  Thus,  fault-free 
interconnection  capability  is  lost.  If  there  are  faults  in  stages  n-I  through  1 
and  either  stage  n  or  0  boxes,  but  not  both,  complete  fault-free  interconnection 
capability  no  longer  exists.  This  is  because  only  one  of  stages  n  and  0  will  be 
available  to  be  enabled,  hence,  there  is  only  one  path  between  and  source  and 
destination.  So,  any  faulty  links  or  faulty  boxes  in  stage  i,  1  <  i  <  n,  will 
block  the  only  path  between  certain  source/destination  pairs.  Finally,  if  a  pair 
of  fault  labels  fails  the  test  of  Theorem  4.6,  fault-free  interconnection 
capability  is  lost. 

Figure  4.4  shows  an  ESC  network  with  N  =  8  and  two  faults  which  do  not 
cause  a  loss  of  fault- free  interconnection  capability.  The  dashed  lines  indicate 
the  two  faults,  which  have  fault  labels  (2,2)  and  (1,4).  The  bold  lines  indicate 
all  possible  paths  through  the  network  that  contain  the  fault  (2,2).  The  paths 
are  shown  using  only  the  connections  in  the  interchange  boxes  to  simplify  the 
figure.  The  bold  lines  should  be  taken  to  denote  the  two  alternative 
connections  through  each  box  that  could  be  part  of  a  path  containing  fault 
(2.2).  The  dotted  lines  are  analogous  to  the  bold  lines,  but  for  fault  (1,4). 
Note  that  no  source  has  both  possible  stage  n  outputs  leading  to  fault  (2,2)  or 
(1,4),  and  no  destination  has  both  possible  stage  0  inputs  coming  from  the 


faults.  Therefore,  do  matter  what  destination  is  paired  with  any  source  there 
will  exist  a  fault-free  connecting  path. 

If  fault-free  interconnection  capability  exists  then  full  operation  can 
continue.  To  continue,  any  additional  fault  labeb  are  sent  to  each  source. 
However,  a  source  must  now  check  a  primary  path  against  a  longer  list  of  fault 
labels  to  determine  if  that  path  is  fault-free.  Therefore,  system  performance 
may  be  degraded  somewhat. 

For  an  SIMD  system  where  interconnection  network  routing  requirements 
are  limited  to  a  relatively  small  number  of  known  mappings,  multiple  faults 
that  preclude  fault- free  interconnection  capability  might  not  impact  system 
function.  This  would  occur  if  all  needed  permutations  could  be  performed 
(although  each  would  require  two  passes  (see  Section  4.7)).  Similar  faults  in 
MSIMD  or  MIMD  systems  may  leave  some  processes  unaffected.  For  these 
situations,  and  if  fail-soft  capability  is  important,  it  is  useful  to  determine 
which  source/destination  pairs  are  unable  to  communicate.  The  system  might 
then  attempt  to  reschedule  processes  such  that  their  needed  communication 
paths  will  be  available,  or  assess  the  impact  the  faults  will  have  on  its 
performance  and  report  to  the  user. 

The  exact  conditions  under  which  no  fault-free  path  exists,  for  some  one- 
to-one  connection,  can  be  determined  and  the  affected  source/destination  pairs 
characterized. 


Corollary  4.I:  Let  A  =  (i,an_,...a,ao)  and  B  =  (j,bn_i...b,bo),  where 
1  <  j  <  i  <  n,  be  two  fault  labels.  If  a„_i...aj  + jaj  =  bn_j...bi  +  jbi  and 
aj_|...a|ao  =  bj_,...b(bo  then  there  exist  source/destination  pairs  for  which  no 
fault-free  path  exists.  These  pairs  are  such  that  Si_,...S2Si  =  ai_i...a2a|. 


d„_,...dj  +  ,dj  =  b„_,...bj  +  ,bj,  and  s„_,...Si+,Si,  s<,,  and  dj_,...d,do  are  arbitrary. 

Proof:  From  Theorem  4.5,  the  two  paths  between  a  source  and  destination  use 
stage  k  outputs  which  differ  only  in  the  low-order  bit,  1  <  k  <  n.  Assume  a 
path  contains  the  fault  denoted  by  A.  Then  the  stage  i  output  used  is  such 
that  dn_i...di  +  idiSi_,...S2SjX  =  a„_i...a|afl  and  the  stage  j  output  used  is 
d„_j...dj  +  idjSj_i...S2S|X  where  x  equals  s^  or  sq  depending  on  whether  the  path  b 
primary  or  secondary.  Now  a,j_i...ai+iai  =  b„_j...bi+jbi  and 
aj_j...aiao  =  bj_i...bjb().  Thus,  the  alternate  path  uses  stage  j  output 

+  +  i  djbj-i.-bjbo.  If  dj_j...dj  +  |dj  —  bi_|...bj  +  |bj,  then  the 

alternate  path  contains  the  fault  denoted  by  B.  Therefore,  there  exbt 
source/destination  pairs  for  which  no  fault-free  path  exists. 

The  relationships  d„_i...di+|diSi_|...S2S|X  =  a„-|...a|ao  and 

d„_,...dj  +  idjSj_i...S2S,x  =  b„_p..b,bo  yield  the  constraints  on  the  source  and 
destination  addresses  of  s,_i...S2Si  =  aj-i.-.a^ai  and  dn_i...dj  + |dj  =  b„_|...bj  +  |bj. 
The  values  of  Sn_,...Si  +  ,Si,  Sq,  and  dj_i...d|do  are  unconstrained. 

□ 

Figure  4.5  shows  an  ESC  with  N  =8  for  a  case  of  two  faults  with  fault 
labeb  which  fail  the  Theorem  4.6  criterion.  The  affected  sources  and 
destinations  are  indicated  by  the  circles.  The  affected  sources  (0,  1,  4,  and  5) 
have  paths  that  can  lead  to  one  of  the  faults  on  each  of  their  stage  n  outputs. 
The  affected  destinations  (4,  5,  6,  and  7)  have  paths  that  can  come  from  one  of 
the  faults  on  each  of  their  stage  0  inputs.  An  affected  source  cannot 
communicate  with  affected  destinations,  but  can  with  unaffected  ones. 


Figure  4.5  ESC  network  with  N=8  and  faults  (2,5),  (1,4),  and  (1,6) 
(indicated  by  broken  lines),  which  cause  a  loss  of  fault-free 
interconnection  capability.  The  affected  network  inputs  and 
outputs  are  circled. 


Similarly,  affected  destinations  can  receive  information  from  unaffected  sources, 
and  not  affected  ones.  Unaffected  sources  and  destinations  can  communicate  as 
if  the  network  were  fault-free. 

For  broadcast  paths,  continued  operation  under  multiple  faults  is 
somewhat  more  complicated  since  faults  can  exist  in  both  a  primary  and 
secondary  broadcast  path  without  compromising  fault-free  (one-to-one) 
interconnection  capability.  The  tests  to  determine  if  a  primary  path  contains  a 
fault,  which  were  described  in  Section  4.3.3,  can  be  applied  in  this  case.  To 
check  for  a  possible  fault  in  a  secondary  path,  sj)  is  used  in  place  of  s®.  If  both 
paths  contain  faults,  a  combination  of  primary  and  secondary  paths  can  be 
used  to  perform  the  broadcast.  However,  this  procedure  may  be  too  time 
consuming  to  be  practical. 

The  exact  conditions  under  which  no  fault-free  broadcast  path  exists,  for 
some  broadcast,  can  be  determined  and  the  affected  broadcasts  characterized. 

Corollary  4-2:  Let  A  =  (i,an_j...aja0)  and  B  =  (j,b„_|...b|b(j),  where 
1  <  j  <  i  <  n,  be  any  two  fault  labels.  If  aj_|...a,ao  =  bj_i...bjb(j,  then  there 
exist  broadcasts  for  which  no  fault-free  broadcast  path  exists.  These 
broadcasts  are  such  that  Si_i...S2Sj  =  ai_i...a2a|,  d^_i...dj'Vjdi’'  =  a^-i-.-ai^-iaj, 
and  dn_j...dj+,dj  =  b„_i...bj  +  ibj,  where  S  =  s„_i...SiSo  is  the  source  and 

=  d„k_,...d,‘‘do^  and  D'  =  d,l_,...djd^  are  two  of  the  destinations  (and  are  not 
necessarily  distinct). 

Proof:  In  general,  a  broadcast  path  uses  stage  i  outputs  of  the  form 
dn.-|...dj+ |diSj^,...S2S|X,  where  x  equals  s^  or  Sq  depending  on  whether  the 
broadcast  path  is  primary  or  secondary,  and  D  =  d„_j...djdo  represents  one  of 
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the  broadcast  destinations.  As  a  consequence  of  Theorem  4.5,  the  two 
broadcast  paths  from  a  source  to  a  set  of  destinations  use  stage  m  outputs 
which  differ  only  in  the  low-order  bit,  where  1  <  m  <  n.  Thus,  the  alternate 
broadcast  path  uses  stage  j  outputs  of  the  form  d„_j...dj  +  jdjSj_|...S2SjX.  To 
construct  those  broadcasts  with  faulty  primary  and  secondary  paths  due  to 
faults  A  and  B,  consider  the  following.  Let  the  source  S  =  s„-|...SiSo  be  such 
that  Si_j...S2Si  =  a;_i...a2ai,  and,  without  loss  of  generality  let  x  =  a^.  Let 
=  dn_i...d}'do,  one  of  the  destinations,  be  such  that  d^_|...dj*+idi’‘ 
=  an_i...ai  +  ,ai.  Let  D*  =  d^_|...djd,},  another  destination  (not  necessarily 
distinct  from  D*'),  be  such  that  dJ[_j...dj  +  jdj  =  b„_|...bj  + jbj.  Given 

aj_,...ai^  =  bj_,...bibo,  then  the  equalities  d;5^_,...di*+,di'‘Si_,...S2SiX  = 

an_,...a,afl  and  dn_i...dj  +  |  dj'sj_,...S2SiX  =  b„_i...bibo  are  true.  Any  broadcast 
for  which  the  equalities  hold  does  not  have  a  fault-free  primary  or  secondary 
broadcast  path. 

□ 


Figures  4.6,  4.7,  and  4.8  show  how  a  broadcast  may  have  to  be  performed 
given  multiple  faults,  even  though  fault-free  interconnection  capability  b 
retained.  Figure  4.6  depicts  an  ESC  for  N  =  8  with  faults  (2,2)  and  (1,5), 
which  do  not  cause  loss  of  fault-free  interconnection  capability,  and  the 
primary  broadcast  path  for  the  broadcast  from  input  2  to  outputs  1,  3,  5,  and 
7.  Stages  n  and  0  are  enabled  due  to  the  faults.  The  primary  broadcast  path 
contains  the  fault  (2,2).  Figure  4.7  shows  the  secondary  broadcast  path  which 
contains  fault  (1,5).  A  broadcast  path  combining  fault-free  portions  of  the 
primary  and  secondary  broadcast  paths  is  indicated  in  Figure  4.8. 


Figure  4.8  ESC  network  with  N=8  and  faults  (2,2)  and  (1,5)  showing  a 
fault-free  broadcast  path  from  input  2  to  outputs  1,  3,  5,  and  7 
that  combines  portions  of  the  primary  and  secondary  broadcast 


Table  4.1  summarizes  the  informatioD  on  the  consequences  of,  or  action 
required  in  response  to,  multiple  faults  in  the  ESC.  A  set  of  five  fault 
categories  can  describe  any  instance  of  multiple  faults  in  an  ESC  network.  The 
categories  are:  all  faults  in  stage  n  boxes,  all  faults  in  stage  0  boxes,  at  least 
one  but  not  all  faults  in  stage  n  boxes,  at  least  one  but  not  all  faults  in  stage  0 
boxes,  and  all  faults  in  neither  stage  n  nor  stage  0  boxes.  All  but  two  of  these 
categories  are  mutually  exclusive;  the  category  “at  least  one  but  not  all  faults 
in  stage  n  boxes”  can  refer  to  a  multiple  fault  situation  that  fits  “at  least  one 
but  not  all  faults  in  stage  0  boxes.”  For  each  fault  category  the  table  either 
states  whether  fault-free  interconnection  capability  is  present  or  absent,  or  lists 
what  further  action  is  needed  to  make  that  determination. 

4.6  Routing  Tags 

The  use  of  routing  tags  to  control  the  Generalized  Cube  topology  has  been 
discussed  in  [Law75,  SiMSlb].  A  broadcast  routing  tag  has  also  been  developed 
[SiMSlb,  Wen76).  The  details  of  one  routing  tag  scheme  are  summarized  here 
to  provide  a  basis  for  describing  routing  tags  for  the  ESC. 

For  one-to-one  connections  in  the  Generalized  Cube  an  n-bit  tag  is 
computed  from  the  source  address,  S,  and  the  destination  address,  D.  The 
routing  tag  T  =  S©D,  where  ®  means  bitwise  exclusive-or  [SiMSlb].  Let 
tn_i...t,to  be  the  binary  representation  of  T.  To  determine  its  required  setting, 
an  interchange  box  at  stage  i  need  only  examine  tj.  If  tj  =  J,  the  straight  state 
is  used;  if  t|  =  1,  an  exchange  is  performed.  For  example,  given  S  =  001  and 
D  =  100,  then  T  =  101,  and  the  box  settings  are  exchange,  straight,  and 
exchange.  Figure  4.9  illu.strates  this  route  in  a  fault-free  ESC  (which 
effectively  has  the  topology  of  the  Generalized  Cube  network). 


Table  4.1  Summary  of  information  on  the  multiple  fault  tolerance  of  the 
ESC  network.  “FFIC”  stands  for  fault-free  interconnection 
capability. 
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The  routing  tag  scheme  can  be  extended  to  allow  broadcasting  from  a 
source  to  a  power  of  two  destinations  with  one  constraint.  That  is,  if  there  are 
2^  destinations,  0  <  j  <  n,  then  the  Hamming  distance  (number  of  differing  bit 
positions)  [Lin70]  between  any  two  destination  addresses  must  be  less  than  or 
equal  to  j  [SLMSlb].  Thus,  there  is  a  fixed  set  of  j  bit  positions  where  any  pair 
of  destination  addresses  may  disagree,  and  n— j  positions  where  all  agree.  For 
example,  the  set  of  addresses  {010,  Oil,  110,  111}  meets  the  criterion. 

To  demonstrate  how  a  broadcast  routing  tag  is  constructed,  let  S  be  the 
source  address  and  D*,  D^,  ...,  be  the  2^  destination  addresses.  The  routing 
tags  are  T;  =  S0D‘,  1  <  i  <  2  ^  These  tags  will  differ  from  each  other  only 
in  the  same  j  bit  positions  in  which  S  may  differ  from  D*,  0  <  i  <  2  ^ 

The  broadcast  routing  tag  must  provide  information  for  routing  and 
determining  branching  points.  Such  tags  consist  of  two  parts,  each  with  n  bits, 
for  a  total  of  2n  bits.  Let  the  routing  information  be  R  =  rn_i...r|ro  and  the 
broadcast  information  be  B  =  bn_j...bibo.  The  j  bits  where  tags  Tj  differ 
determine  the  stages  in  which  broadcast  connections  will  be  needed.  The 
broadcast  routing  tag  {R,B}  is  constructed  by  setting  R  =  T-,  for  any  i,  and 
B  =  where  D*'  and  D’,  where  are  any  two  destinations  which  differ  by  j 

bits. 

To  interpret  {R,B},  an  interchange  box  in  stage  i  must  examine  r;  and  b;. 
If  b|  =  0,  rj  has  the  same  effect  as  tj,  the  i^**  bit  of  the  one-to-one  connection 
tag.  If  b|  =  1,  r;  is  ignored  and  an  upper  or  lower  broadcast  is  performed 
depending  upon  whether  the  route  uses  the  upper  or  lower  box  input, 
respectively.  For  example,  if  S  =  101,  I)’  =  010,  D‘  =  Oil,  D®  =  110,  and 
=  111,  then  R  =  111  and  B  =  101.  The  network  configuration  for  this 
broadcast  is  shown  in  Figure  4.10  for  a  fault-free  ESC. 
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Both  routing  tags  and  broadcast  routing  tags  for  the  ESC  that  take  full 
advantage  of  its  fault  tolerant  capabilities  can  be  derived  from  the  tag  schemes 
for  the  Generalized  Cube.  The  ESC  uses  n  =  1  bit  routing  tags.  The  one-to- 
one  routing  tag  is  T*  =  tn...tJto  and  the  broadcast  routing  tag  is  {R  ,B  }. 
where  R*  =  rn...rJro  and  B’  =  b’...b|b0.  The  additional  bit  position  is  to 
control  stage  n.  Actual  tag  values  depend  on  whether  the  ESC  has  a  fault  as 
well  as  on  the  source  and  destination  addresses,  but  are  readily  computed. 

First  consider  the  fault-free  case.  For  both  routing  and  broadcast  tags, 
the  n*'^  bit  will  be  ignored  since  stage  n  b  disabled  when  there  are  no  faults. 
The  routing  tag  is  given  by  T  =  - ^*■0)  where  tn_,...t|tQ  =  T,  the  tag 

used  in  the  Generalized  Cube.  The  bit  t,,  can  be  set  to  either  0  or  1.  The  bits 
of  T’  are  interpreted  in  the  same  way  as  tag  bits  in  the  Generalized  Cube 
scheme.  The  broadcast  routing  tag  is  composed  of  R  =  rnrn_j...r,ro  and 
B'  =  bnbn_,...b,bo,  where  rn.,...r,ro  =  R,  b„_,...bjbo  =  B,  and  r'  and  b'  are 
arbitrary.  Again,  the  bits  of  {R’,B'}  have  the  same  meaning  as  in  the 
Generalized  Cube. 

Now  routing  tag  and  broadcast  routing  tag  dehnitions  for  use  in  the  ESC 
with  a  fault  will  be  described.  With  regard  to  routing  tags,  the  primary  path 
in  the  ESC  is  that  corresponding  to  the  tag  T  =  0t„_j...tjtQ,  and  the  secondary 
path  is  that  associated  with  T'  =  lt„_|...t,to.  The  primary  broadcast  path  is 
specified  by  R'  =  0r„_j...r|rQ  and  B  =  Ob„_j...bjb0,  whereas  R  =  lr„_|...rjr0 
and  B’  =  Ob„_p..bibo  denote  the  secondary  broadcast  path. 

It  is  assumed  that  the  system  has  appropriately  reconfigured  the  network 
and  distributed  fault  labels  to  all  sources  as  required.  With  the  condition  of 


the  primary  path  known,  a  routing  tag  that  avoids  the  network  fault  can  be 
computed. 

Theorem  J^.l:  For  the  ESC  with  one  fault,  any  one-to-one  connection 
performable  on  the  Generalized  Cube  with  the  routing  tag  T  can  be  performed 
using  the  routing  tag  T*  obtained  from  the  following  rules. 

1.  If  the  fault  is  in  stage  0,  use  T*  =  tot„_i...tjto,  where  t^  is  arbitrary. 

2.  If  the  fault  is  in  a  link  or  in  a  box  in  stages  n-1  to  1  and  the  primary 
path  is  fault-free,  use  T*  =  0tp_|...tjt<).  If  the  the  primary  path  is 
faulty,  use  the  secondary  path 't  —  lt„_j...t,tQ. 

3.  If  the  fault  is  in  a  stage  n  box,  use  =  tpt„_i...t,to,  where  t^  is 
arbitrary. 

Proof:  Assume  that  the  fault  is  in  stage  0,  i.e.,  stage  n  b  enabled  and  stage  0 
disabled.  Since  stage  n  duplicates  stage  0  (both  perform  cubeg),  routing  can  be 
accomplished  by  substituting  stage  n  for  stage  0.  The  tag  T’  =  t5t„_|...tjt^ 
does  this  by  placing  a  copy  of  to  in  the  n‘^  bit  position.  Stage  n  then  performs 
cubeo  as  necessary.  Note  that  the  low-order  bit  position  of  T*,  t^,  will  be 
ignored  since  stage  0  is  dbabled. 

Assume  the  fault  b  in  a  link  or  in  a  box  in  stages  n-1  to  1.  T  specifies 
the  primary  path.  If  this  path  is  fault-free,  setting  T’  =  Ot„_|...tito  will  use 
this  path.  The  0  in  the  n*'^  bit  position  is  necessary  because  stages  n  and  0  are 
enabled,  given  the  assumed  fault  location.  If  the  path  denoted  by  T  contains 
the  fault,  then  the  secondary  path  b  fault-free  by  Theorem  4.2  and  must  be 
used.  It  b  reached  by  setting  the  high-order  bit  of  T^  to  1.  Thb  maps  S  to  the 


input  cubeo(S)  of  stage  n-1.  To  complete  the  path  to  D,  bits  n-1  to  0  of  T* 
must  be  cubeo(S)0D  =  t„_i...titQ.  Thus,  T^  =  lt„_|...t|t0. 

Finally,  assume  the  fault  is  in  stage  n.  Stage  n  will  be  disabled,  and  the 
routing  tag  needed  will  be  the  same  as  in  the  fault-free  ESC. 

□ 

Recall  from  Section  4.3.3  that  the  procedure  for  determining  if  a  primary 
broadcast  path  is  faulty  may  result  in  unnecessary  use  of  the  secondary 
broadcast  path.  As  the  following  theorem  shows,  generating  broadcast  routing 
tags  to  use  the  secondary  broadcast  path  incurs  minimal  additional  overhead 
relative  to  primary  broadcast  path  tags.  Thus,  unnecessary  use  of  secondary 
broadcast  paths  is  of  little  consequence. 

Theorem  4-S:  For  the  ESC  with  one  fault,  any  broadcast  performable  on  the 
Generalized  Cube  with  the  broadcast  routing  tag  {R,B}  can  be  performed 
using  the  broadcast  routing  tag  {R’,B'}  obtained  from  the  following  rules. 

1.  If  the  fault  is  in  stage  0,  use  R*  =  rQr„_|...rirQ  and  B'  =  b0b„_j...b|b0, 
where  rQ  and  b^  are  arbitrary. 

2.  If  the  fault  is  in  a  link  or  in  a  box  in  stages  n  - 1  to  1  and  the  primary 
broadcast  path  is  fault-free,  use  R'  =  Or„_i...rjr0  and  B'  =  Obn_j...bib0. 
If  the  secondary  broadcast  path  has  been  chosen,  use  R'  =  lrQ_|...r|T0 
and  B'  =  Obp_,...b|b0. 

3.  If  the  fault  is  in  a  stage  n  box,  use  R'  =  r^r„_j...riro  and 
B’  =  b|,bn_j...b|b0,  where  r^  and  b^  are  arbitrary. 


Proof:  Assume  that  the  fault  is  in  a  stage  0  box.  As  a  direct  consequence  of 
Lemma  4.5,  any  broadcast  performable  on  the  Generalized  Cube  using  the 
broadcast  routing  tag  {R,B}  is  performable  on  the  ESC  with  stage  0  disabled 
and  stage  n  enabled  (i.e.,  stage  0  faulty).  The  broadcast  routing  tag  substitutes 
stage  n  for  stage  0  by  having  rg  and  b^  copied  into  r'  and  b',  respectively. 
This  results  in  the  same  broadcast  because  the  order  in  which  the 
interconnection  functions  are  applied  is  immaterial  for  one-to-one  routing  and 
broadcasting.  Specifically,  if  r^  =  0  and  bj  =  0,  then  in  the  set  of  destination 
addresses,  dj  =  Sj;  if  r;  =  1  and  bj  =  0,  then  dj  =  Sj;  and  if  b,-  =  1,  then  dj  can 
be  1  or  0.  When  bj  =  0  and  there  is  a  fault  in  stage  0,  if  ro  =  0,  the  primary 
broadcast  path  is  used,  and  if  rg  =  1,  the  secondary  broadcast  path  is  used. 
When  bj  =  1  and  stage  0  is  faulty,  the  stage  n  interchange  box  routing  the 
message  performs  a  broadcast,  and  a  combination  of  primary  and  secondary 
paths  connect  the  source  to  its  destinations.  Each  address  bit  b  affected 
individually,  making  the  order  of  stages  irrelevant. 

Assume  the  fault  is  in  a  link  or  a  box  in  stages  n-1  to  1.  {R,B}  specifies 
the  primary  broadcast  path.  If  it  b  fault-free,  setting  R'  =  0r„_i...rirQ  and 
B’  =  Obn_i...b,bo  will  use  thb  broadcast  path.  If  the  primary  broadcast  path 
contains  the  fault  then  the  secondary  broadcast  path  b  fault  free  as  a 
consequence  of  Theorem  4.4.  Setting  R'  =  lr„_i...riro  and  B*  =  Obg_]...b|b0 
causes  the  broadcast  to  be  performed  using  the  secondary  broadcast  path. 

Finally,  assume  the  fault  b  in  stage  n.  Stage  n  will  be  dbabled,  and  the 
broadcast  routing  tag  needed  will  be  the  same  as  in  the  fault-free  ESC. 


Theorems  4.7  and  4.8  are  important  to  MIMD  operation  of  the  network 
because  they  show  that  the  fault  tolerant  capability  of  the  ESC  is  available 
through  simple  manipulation  of  the  usual  routing  or  broadcast  tags.  Table  4.2 
summarizes  routing  tags  and  Table  4.3  summarizes  broadcast  routing  tags  for 
the  ESC. 

In  the  case  of  multiple  faults  where  the  conditions  of  Theorem  4.6  are  met 
(i.e.,  there  exists  at  least  one  fault>free  path  between  any  source  and 
destination),  routing  tag  utility  is  unchanged.  That  is,  each  source  checks  the 
primary  path  for  faults,  but  against  a  longer  list  of  fault  labels.  The  routing 
tag  is  still  formed  as  in  rule  2  of  Theorem  4.7. 

Broadcast  tags  can  be  used  to  determine  if  the  primary  or  secondary 
broadcast  path  of  the  broadcast  specified  by  the  tag  contains  a  fault.  To  check 
if  a  fault  in  stage  i  is  in  the  primary  broadcast  path,  the  source  constructs 
H  =  h„-i...h|ho  such  that  for  0  <  j  <  i,  hj  =  Sj,  and  for  i  <  j  <  n-1,  if  bj  =  1 
then  hj  =  X  (“don’t  care”),  otherwise  hj  =  Sj©rj.  If  H  matches  a  fault  label 
(with  “x”  matching  0  or  1),  then  the  primary  path  contains  a  fault.  Note  that 
if  §■()  is  used  in  place  of  Sq,  the  secondary  broadcast  path  can  be  checked.  This 
test  can  be  used  for  both  single  faults  and  multiple  faults  (by  repeating  the 
test  for  each  fault  label). 

4.6  Partitioning 

The  partitionability  of  a  network  b  the  ability  to  divide  the  network  into 
independent  subnetworks  of  different  sizes  [SieSO].  Each  subnetwork  of  size 
<  N  must  have  ail  the  interconnection  capabilities  of  a  complete  network  of 
that  same  type  built  to  be  of  size  N'.  A  partitionable  network  can  be 
characterized  by  any  limitations  on  the  way  in  which  it  can  be  subdivided. 


Table  4.2  One-to-one  routing  tags  for  the  ESC.  The  symbol  “x"  represents 
either  0  or  1. 


F  ault  Location 

Routing  Tag  T* 

No  F  ault 

O 

T 

e 

X 

II 

Stage  0 

Stage  i  box, 

T^  =  Ot„_,...tit0 

if  primary  path 

is  fault  free; 

0<i<n, 

or  any  link 

T 

a 

II 

if  primary  path 

contains  a  fault 

Stage  n  box 
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the  ESC,  The  symbol  “x*  represents 


No  Fault 
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Stage  i  box, 
0  <  i  <  n, 
or  any  link 


Broadcast  Routing  Tag  {R\  B'} 

R'  = 

xr„_,...r,ro 

B'  = 

xb„_,...b,bo 

R'=: 

ror^,...r,x 

B'  = 

bob„_,...b,x 

R'  = 

Or„.,...r,ro 

B'  = 

Ob„_i...b,bo 

if  primary  broadcast  path 

is  fault  free; 

R'  = 

lr„_i...r,ro 

B'  = 

Obn-i —bibo 

if  primary  broadcast  path 

contains  a  fault 

R'  = 

xr„_,...r,ro 

B'  = 

xb„_,...b,bo 
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Such  a  uetwork  allows  an  MSIMD,  partitionable  SIMD/MIMD,  or  MIMD 
machine  to  be  dynamically  reconfi^red  into  independent  subsystems. 

The  Generalized  Cube  can  be  partitioned  into  two  subnetworks  of  size 
N/2  by  forcing  all  interchange  boxes  to  the  straight  state  in  any  one  stage 
[SieSOj.  All  the  input  and  output  port  addresses  of  a  subnetwork  will  agree  in 
the  i*^^  bit  position  if  the  stage  that  is  set  to  all  straight  is  the  i^**  stage.  For 
example,  Figure  4.11  shows  how  a  Generalized  Cube  with  N=8  can  be 
partitioned  on  stage  2,  or  the  high-order  bit  position,  into  two  subnetworks 
with  N'  =4.  The  two  subnetworks  are  denoted  by  the  A  and  B  labels  on  the 
interchange  boxes.  Since  both  subnetworks  have  all  the  properties  of  a 
Generalized  Cube,  they  can  be  further  subdivided  independently.  This  allows 
the  network  to  be  partitioned  into  various  network  port  groupings  each  with  a 
power  of  two  ports.  For  example,  a  network  of  size  N=64  could  be 
partitioned  into  five  subnetworks  with  one  each  of  sizes  32,  16,  8,  4,  and  4. 

The  ESC  can  be  partitioned  in  a  similar  manner,  with  the  property  that 
each  subnetwork  has  the  attributes  of  the  ESC,  including  fault  tolerance.  The 
only  constraint  is  that  the  partitioning  cannot  be  done  using  stage  n  or  stage  0. 

Theorem  4-^-  The  ESC  can  be  partitioned  with  respect  to  any  stage  except 
stages  n  and  0. 

Proof:  The  cube  functions  cube„_|  through  cubej  each  occur  once  in  the  ESC. 
Setting  stage  i,  1  <  i  <  n-1,  to  ail  straight  separates  the  network  input  and 
output  ports  into  two  independent  groups.  Each  group  contains  ports  whose 
addresses  agree  in  the  i^**  bit  position,  i.e.,  all  addresses  have  their  i‘**  bits  equal 
to  0  in  one  group,  and  1  in  the  other.  The  other  n  stages  provide  the  cube^ 
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functions  for  0  <  j  <  n  and  where  cubeg  appears  twice.  This  comprbes 
an  ESC  network  for  the  N/2  ports  of  each  group.  As  with  the  Generalized 
Cube,  each  subnetwork  can  be  further  subdivided.  Since  the  addresses  of  the 
interchange  box  outputs  and  links  of  a  primary  path  and  a  secondary  path 
differ  only  in  the  bit  position,  both  paths  will  be  in  the  same  partition  (i.e., 
they  will  agree  in  the  bit  position(s)  upon  which  the  partitioning  is  based). 
Thus,  the  fault-tolerant  routing  scheme  of  the  ESC  is  compatible  with  network 
partitioning. 

If  partitioning  is  attempted  on  stage  n  the  result  will  clearly  be  a 
Generalized  Cube  topology  of  size  N.  Attempting  to  partition  on  stage  0  again 
yields  a  network  of  size  N,  in  particular  a  Generalized  Cube  with  cubep  6rst, 
not  last.  In  neither  case  are  independent  subnetworks  formed. 

□ 

In  Figure  4.12  an  ESC  with  N=8  is  shown  partitioned  with  respect  to 
stage  2.  The  two  subnetworks  are  indicated  by  the  labels  A  and  B. 
Subnetwork  A  consists  of  input  and  output  ports  0,  1,  2,  and  3.  These  port 
addresses  agree  in  the  high-order  bit  position  (it  is  0).  Subnetwork  B  contains 
input  and  output  ports  4,  5,  6,  and  7,  all  of  which  agree  in  the  bigb-order  bit 
position  (it  is  1). 

Partitioning  can  be  readily  accomplished  by  combining  routing  tags  with 
masking.  An  (n  +  l)-bit  mask  can  be  used  to  define  which  stages  are  to  be  used 
to  establish  a  partition.  By  logically  ANDing  tags  with  masks  to  force  to  0 
those  tag  positions  corresponding  to  interchange  boxes  that  should  be  set  to 
straight,  partitions  can  be  enforced.  This  process  is  external  to  the  network 
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and,  so,  independent  of  a  network  fault.  Thus,  partitioning  is  unimpeded  by  a 
fault. 

In  PASM  partitioning  is  designed  to  be  based  on  input/output  port 
addresses  within  a  group  agreeing  in  some  number  of  low-order  bit  positions. 
Figure  4.13  shows  the  Generalized  Cube  establishing  a  partition  on  the  low- 
order  bit  position.  The  two  subnetworks  are  again  indicated  by  the  labels  A 
and  B.  The  ESC  as  previously  defined  cannot  support  this  type  of  partition. 
However,  a  variation  of  the  ESC  can  perform  low-order  bit  partitioning.  An 
ESC-like  network  can  be  constructed  by  adding  an  extra  stage  to  the  output 
side  of  the  Generalized  Cube  network  that  implements  cube„_j.  Call  this  new 
stage  -1.  Thus,  from  the  input  to  the  output,  the  stages  implement  cube^-i, 
cube„.2f  cubej,  cubeo,  and  cube„_j.  The  same  fault- tolerant  capabilities  are 
available  with  this  new  network,  but  partitioning  may  be  done  on  stage  0. 
Hence,  low-order  bit  partitioning  is  available.  Partitioning  on  stages  n-l  and 
-1  is  not  available.  Figure  4.14  illustrates  this  variation  on  the  ESC.  The  two 
subnetworks  formed  by  partitioning  on  stage  0  are  labeled  A  and  B,  as  before. 

4.7  Permuting 

In  SIMD  mode  generally  all  or  most  sources  will  be  sending  data 
simultaneously.  Sending  data  from  each  source  to  a  single,  distinct  destination 
is  referred  to  as  permuting  data  from  input  to  output.  A  network  can  perform 
or  pass  a  permutation  if  it  can  map  each  source  to  its  destination  without 
conflict.  Conflict  occurs  when  two  or  more  paths  include  the  same  output  of  a 
switching  element. 

The  fault-free  ESC  clearly  has  the  same  permuting  capability  as  the 
Generalized  Cube.  That  is,  any  permutation  performable  by  the  Generalized 
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Figure  4.13  Generalized  Cube  network  with  N=8  partitioned  into  two 
subnetworks  of  size  =4  based  on  the  low-order  bit  position. 
The  A  and  B  labels  denote  the  two  subnetworks. 


Cube  is  performable  by  the  ESC.  If  stage  n  in  a  fault-free  ESC  is  enabled,  the 
permuting  capability  is  a  superset  of  the  Generalized  Cube.  Also,  the  ESC 
routing  tags  discussed  in  Section  4.5  are  entirely  suitable  for  use  in  an  SEMD 
environment. 

Because  of  its  fault-tolerant  nature,  it  is  possible  to  perform  permutations 
on  the  ESC  with  a  single  fault.  It  can  be  shown  that  in  this  situation  two 
passes  are  sufficient  to  realize  any  Generalized  Cube  performable  permutation. 

Theorem  4- 10'  In  the  ESC  with  one  fault  all  Generalized  Cube  performable 
permutations  can  be  performed  in  at  most  two  passes. 

Proof:  If  a  stage  n  interchange  box  is  faulty,  the  stage  is  bypassed  and  the 
remainder  of  the  ESC  performs  any  passable  permutation  with  a  single  pass.  If 
the  fault  is  in  a  stage  0  box  the  permutation  can  be  accomplished  in  two  passes 
as  follows.  In  the  6rst  pass,  stages  n  and  0  are  bypassed  and  the  remaining 
stages  are  set  as  usual.  On  the  second  pass,  stage  n  is  set  as  stage  0  would 
have  been,  stages  n-1  through  1  are  set  to  straight,  and  stage  0  is  again 
bypassed.  This  simulates  a  pass  through  a  fault-free  network. 

While  stages  n  to  1  of  the  ESC  provide  the  complete  set  of  cube 
interconnection  functions  found  in  the  Generalized  Cube,  a  single  pass  through 
the  stages  in  this  order  does  not  duplicate  its  permuting  capability.  For 
example,  the  Generalized  Cube  can  perform  a  permutation  which  includes  the 
mappings  0  to  0  and  1  to  2.  Stages  n  to  1  of  the  ESC  cannot  do  this.  The 
order  of  the  stages  is  important.  Thus,  the  two  pass  procedure  given  is 


necessary. 
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When  the  fault  is  in  a  link  or  a  box  in  stages  n-1  to  1,  then  at  the  stage 
containing  the  fault  there  are  less  than  N  paths  through  the  network.  Thus,  N 
paths  cannot  exist  simultaneously.  The  permutation  can  be  completed  in  two 
passes  in  the  following  way.  First,  all  sources  with  fault- free  primary  paths  to 
their  destination  are  routed.  One  source  will  not  be  routed  if  the  failure  was  in 
a  link,  two  if  in  a  box.  With  a  failed  link,  the  second  pass  routes  the 
remaining  source  to  its  destination  using  its  fault-free  secondary  path.  With  a 
faulty  box,  the  secondary  paths  of  the  two  remaining  sources  will  route  to  their 
destinations  without  conflict.  Recall  that  paths  conflict  when  they  include  the 
same  box  output,  and  the  primary  paths  of  a  passable  permutation  do  not 
conflict.  Thus,  the  stage  i  output  labels  of  the  two  primary  paths  are  distinct, 
for  0  <  i  <  n.  The  secondary  path  stage  i  output  labels  differ  from  the 
primary  path  labels  only  by  complementing  the  0‘**  bit  position.  Therefore,  the 
secondary  path  output  labels  are  also  distinct,  for  0  <  i  <  n.  Hence,  the 
secondary  paths  do  not  conflict. 

□ 


Permutation  passing  can  be  extended  naturally  to  the  multiple  fault 
situation. 

Corollary  4.S:  In  the  ESC  with  multiple  faults  but  retaining  fault-free 
interconnection  capability,  all  Generalized  Cube  performable  permutations  can 
be  performed  in  at  most  two  passes. 

Proof:  For  a  performable  permutation  the  primary  paths  between  each 
source/destination  paii  are  by  definition  pairwise  nonconflicting.  From  the 


proof  of  Theorem  4.10,  if  two  primary  paths  do  not  conflict  then  their  two 
associated  secondary  paths  do  not  conflict.  Thus,  there  is  no  conflict  among 
the  secondary  paths.  Therefore,  in  the  ESC  with  multiple  faults  but  retaining 
fault-free  interconnection  capability,  a  permutation  can  be  performed  by  first 
passing  data  over  those  primary  paths  which  are  fault-free  and  then  passing 
the  remaining  data  using  secondary  paths. 

For  multiple  faults  in  stage  n,  that  stage  is  disabled  and  permutations  are 
performed  in  one  pass.  With  multiple  faults  in  stage  0  the  same  procedure  for 
the  case  of  a  single  stage  0  fault  is  used,  performing  permutations  in  two 
passes. 

□ 

Figures  4.15  and  4.16  illustrate  how  the  permutation  that  sends  data  from 
every  source  x  to  destination  (x  +  2)  modulo  N,  for  0<x<N  and  N  =  8  is 
performed  in  the  presence  of  two  faults  with  fault  labels  (2,2)  and  (1,4).  In 
Figure  4.15  the  fault-free  primary  paths  that  comprise  the  first  pass  are  shown. 
The  remaining  fault-free  secondary  paths  used  in  the  second  pass  are  indicated 
in  Figure  4.16. 

4.8  Conclusions 

This  chapter  has  presented  the  Extra  Stage  Cube  interconnection  network, 
a  derivative  of  the  Generalized  Cube  network.  The  ESC  was  shown  to  be 
single-fault  tolerant.  Under  multiple  faults  the  ESC  was  shown  to  be  robust, 
often  retaining  fault-free  interconnection  capability.  A  minor  adaptation  of  the 
“exclusive-or”  routing  tag  and  broadcast  routing  tag  schemes  designed  for  the 
Generalized  Cube  was  described.  This  allows  the  use  of  tags  to  control  a 


Figure  4.15  ESC  network  with  N  =  8  and  faults  (2,2)  and  (1,4)  (indicated  by 
broken  lines)  showing  all  fault-free  primary  paths  for  the 
permutation  mapping  input  x  to  output  (x  +  2)  modulo  N,  for 


faulted,  as  well  as  fault-free,  ESC.  The  partitioning  and  permuting  abilities  of 
the  ESC  were  discussed. 

The  reliability  of  large-scale  multiprocessor  systems  is  a  function  of  system 
structure  and  the  fault  tolerance  of  system  components.  Fault-tolerant 
interconnection  networks  can  aid  in  achieving  satisfactory  system  reliability. 
The  ESC  seems  to  be  a  practical  and  useful  interconnection  network  for  such 
systems. 

The  family  of  multistage  interconnection  networks  of  which  the 
Generalized  Cube  is  representative  has  received  much  attention  in  the 
literature.  Because  of  its  relationship  to  the  Generalized  Cube  network  (which 
is  representative  of  many  multistage  interconnection  networks),  realistic  fault 
model  and  fault-tolerance  criterion,  and  ease  of  routing  in  the  presence  of 
faults,  the  ESC  network  may  also  be  useful  as  a  standard  for  evaluating 
existing  and  future  fault- tolerant  networks. 


CHAPTER  6 

SURVEY  OF  FAULT-TOLERANT  MULTISTAGE 
INTERCONNECTION  NETWORKS  AND 
COMPARISON  TO  THE  EXTRA  STAGE  CUBE 

6.1  Introdnetlon 

Many  interconnection  networks  have  been  proposed  in  the  literature.  Of 
these,  multistage  networks  compose  a  large  subset.  Of  multistage  networks,  a 
somewhat  smaller  group  have  an  ability  to  tolerate  at  least  some  component 
failures.  In  this  latter  group  are  those  multbtage  interconnection  networks  for 
which  the  topology,  in  combination  with  switching  element  structure,  has  been 
designed  specifically  to  achieve  some  particular  fault-tolerance  capability 
(typically,  single-fault  tolerance).  It  is  these  interconnection  networks  on  which 
this  chapter  is  focused. 

There  are  a  number  of  ways  to  realize  a  fault-tolerant  interconnection 
network  that  do  not  involve  design  of  the  network  per  $e.  One  way  is  to  use 
error  correcting  codes  in  the  data  and  control  paths  of  a  network  not  otherwise 
fault  tolerant  [LLY82].  Another  scheme  is  to  implement  the  network  in  several 
independent  bit-slices;  failure  in  one  or  more  slices  can  be  tolerated  by  using 
the  remaining  good  slices  (e.g.,  [SiMSlb]).  Yet  another  method  is  to  replicate 
an  existing  network  and  utilize  the  copies  in  parallel,  perhaps  cross-linking 
them,  so  that  a  failure  in  the  set  does  not  compromise  the  functionality  of  the 
whole  (e.g.,  (KrM83,  ReK84]).  Often  these  techniques,  and  others,  are  used  in 


conjunction  with  a  basic  fault-tolerant  design. 

Multistage  interconnection  networks  for  parallel  processing  designed 
specifically  for  fault  tolerance  have  become  an  active  area  of  research  recently, 
and  survey  papers  are  few  [AgK83,  AdS84a].  This  chapter  presents  the 
Modified  Baseline  network  (WFL82|,  the  Augmented  Delta  network  piJ82],  the 
Multipath  Omega  network  [PaL83a],  the  F-network  [CiS82],  the  Enhanced 
Inverse  Augmented  Data  Manipulator  {McS82a],  the  Gamma  network  [PaR82, 
PaR84],  the  Fault-Tolerant  BeneS  network  [Agr82,  S0R8O],  the  Augmented  C- 
network  [ReK84],  and  /9-networks  [ShH84].  The  papers  introducing  all  but  two 
of  these  networks  (the  Fault-Tolerant  BeneS  network  and  ^networks)  appeared 
in  the  literature  subsequent  to  papers  describing  the  ESC  [AdS82a,  AdS82c, 
AdS82d].  For  this  reason,  this  chapter  appears  after  the  development  of  ESC 
properties  contained  in  Chapter  4. 

Following  the  survey,  these  networks  are  compared  to  the  ESC  on  the 
basis  of  fault  modek  and  faultrtolerance  criteria.  Then  a  common  fault- 
tolerance  model  is  assumed  and  the  fault  tolerance  of  each  network  is  evaluated 
in  that  circumstance.  The  ESC  fault-tolerance  model  is  chosen  as  the  common 
model  for  this  phase  of  the  comparison  because  it  is  the  most  demanding  of 
those  of  the  various  networks.  The  intent  of  this  chapter  is  to  provide  a 
context  in  which  to  view  the  merits  of  the  ESC  fault-tolerant  multistage 
interconnection  network  investigated  in  this  research. 


The  surveyed  networks  fall  into  four  general  categories.  The  Modified 
Baseline,  Augmented  Delta,  Multipath  Omega,  and  F-network  form  a  group 
based  on  the  Generalized  Cube  network  (SiMSlb,  SiS78].  The  first  two  of  these 
achieve  fault  tolerance  by  adding  an  extra  stage  of  switches  to  a  basic  network 
which  is  isomorphic  to  the  Generalized  Cube  topology.  The  Multipath  Omega 
uses  an  extra  stage  or  stages  of  switches,  or  substitutes  different  switches  in  a 
stage  or  stages,  or  some  combination  of  these  methods,  to  add  fault  tolerance 
to  the  Omega  network  {Law75],  which  is  also  isomorphic  to  the  Generalized 
Cube  topology.  The  F-network  gains  fault  tolerance  by  using  a  Generalized 
Cube  network  structure  with  additional  links. 

The  Enhanced  Inverse  Augmented  Data  Manipulator  (Enhanced  lADM) 
and  Gamma  networks  represent  a  second  group;  the  data  manipulator  [Fen74] 
class  of  networks.  The  Enhanced  lADM  network  uses  additional  links,  and  the 
Gamma  network  uses  increased  switching  element  complexity,  to  realize  fault 
tolerance. 

The  Fault-Tolerant  BeneS  network  is  a  third  type  of  network.  It  uses 
2n-l  stages  of  switching  elements  with  one  additional  switching  element  to 
provide  fault  tolerance,  compared  to  the  n  stages  of  switches  in  a  Generalized 
Cube,  where  N  =  2"  is  the  number  of  inputs. 

The  Augmented  C-network  and  /9-networks  fall  into  a  fourth  network 
category.  Each  is  actually  a  family  of  networks,  spanning  a  wide  range  of 
topologies.  The  topology  of  Augmented  C-networks  is  inherently  fault 
tolerant,  while  ^-networks  combine  topology  and  an  operational  technique  to 
achieve  fault  tolerance. 


5.2.1  Modified  Baseline  Network 


The  Modified  Baseline  network  [WFL82]  is  derived  from  the  Baseline 
multistage  interconnection  network  (WuF80].  The  Baseline  topology  consists, 
in  general,  of  a  stage  of  N/r  r-input/t>output  (rxt)  crossbar  switching 
elements  connected  to  t  subnetworks  numbered  0  to  t-1.  This  yields  a 
network  with  N  inputs  and  N  outputs.  If  the  t  outputs  of  each  rxt  switch  are 

N 

numbered  0  to  t-1,  and  the  switches  are  numbered  0  to - 1,  then  output  i 

of  switch  j  is  connected  to  input  j  of  subnetwork  i,  for  0  <  i  <  t— 1,  and 
N 

0  <  j  < - 1.  Each  subnetwork  is  structured  in  the  same  manner  as  the 

entire  network.  This  recursive  process  b  carried  out  until  the  subnetworks  are 
rxt  switches,  which  can  be  implemented  directly. 

The  Baseline  network  has  but  one  path  between  any  source  and 
destination.  Thus,  any  network  component  failure  affects  communication  for 
some  set  of  inputs  and  outputs.  To  lessen  thb  difficulty  an  extra  stage  of 
switching  elements  b  added  to  the  Baseline  network.  Figure  5  1  shows  the 
Modified  Baseline  network  for  r=t  =  2  and  N=8,  and  indicates  the  original 
Baseline  network  and  the  additional  stage.  If  an  extra  stage  incorporating 
switching  elements  with  t  outputs  b  added  at  the  input  side  of  the  network, 
then  there  are  t  connection  paths  for  any  input/output  pair. 

The  Modified  Baseline  network  fault  model  states  that  the  only  network 
components  that  can  fail  are  switching  elements  not  in  the  input  or  output 
stages.  Faulty  switches  are  considered  unusable.  The  fault- tolerance  criterion 
is  retention  of  full  access  [CiS82}.  Recall  that  full  access  b  the  ability  to 
connect  any  givn  input  to  any  output.  The  Modified  Baseline  network  b 
single-fault  tolerant  and  robust  in  the  presence  of  multiple  faults  with  respect 


to  its  fault-tolerance  model  and  fault-tolerance  criterion. 

Routing  in  the  Baseline  network  b  carried  out  using  destination  tags 
[Law75]  that  consbt  of  the  address  of  the  intended  destination  of  a  message.  If 
r  X  r  switches  are  used  (r  =  2  for  Figure  5.1)  then  a  destination  address  D  can 
be  represented  by  a  base-r  number  d„-|...d|do  where  m  =  log^N.  Thb  base-r 
representation  b  used  to  select  a  path  through  the  network  in  the  following 
way.  The  switching  element  connected  to  the  source  will  use  its  output 
numbered  d„,_|  to  link  to  a  switching  element  in  the  next  stage.  At  stage  i,  d; 
b  used  to  determine  the  selection  of  switch  output,  0  <  i  <  m-1.  For  the 
Modified  Baseline  network  an  extra  digit  can  be  appended  to  Baseline  network 
destination  tags  to  control  the  extra  stage.  Thb  assumes  that  sufficient 
information  b  available  to  determine  the  value  of  thb  extra  digit  to  avoid  any 
exbting  fault  [WFL82). 

5.2.2  Augmented  Delta  Network 

Delta,  or  digit  controlled,  networks  [PatSl]  are  a  class  of  multbtage 
interconnection  networks  and  also  a  subset  of  a  very  broad  class  known  as 
banyan  networks  [GoL73].  A  delta  network  b,  in  general,  an  a°  x  b°  network 
with  n  stages,  each  consbtiilg  of  a  x  b  crossbar  switching  elements.  The  link 
pattern  between  stages  provides  a  unique  path  of  fixed  length  between  any 
network  input  and  output.  Further,  the  link  pattern  b  such  that  information 
can  be  routed  from  an  input  to  an  output  using  the  switching  element  output 
corresponding  to  a  base- 6  digit  in  the  base-b  representation  of  the  output 
number. 

An  Augmented  Delta  network  [DiJ82]  can  be  constructed  from  a  Delta 
network,  as  illustrated  by  Figure  5.2.  Let  w  =  log|,N.  The  Augmented  Delta 


network,  shown  in  Figure  5.2,  consists  of  a  stage  of  N/b  b  x  b  switching 
elements  connected  to  b  delta  networks,  each  of  size  b^~‘  x  b*“*.  These  b 
networks  are  labeled  Dq  through  Di,_|  in  the  figure  and  each  b  structured  in 
the  same  way  as  the  b*  x  b'*  network.  Adding  a  stage  of  N/b  b  x  b  switching 
elements  to  tbb,  as  shown  in  the  figure,  results  in  a  topology  with  b  paths 
between  any  pair  of  network  input/output  ports,  given  the  switching  elements 
are  b  x  b  devices.  Additional  redundant  paths  can  be  provided  by  adding  more 
stages.  The  Augmented  Delta  network  b  similar  to  the  Modified  Baseline 
network;  the  dbtinction  between  the  two  b  that  the  definition  of  the 
Augmented  Delta  network  allows  more  than  one  extra  stage  and  the  switches 
of  any  extra  stage(s)  are  identical  to  all  others  in  the  network.  So  that  thb 
network  will  be  comparable  with  the  other  networks  dbcussed  in  thb  chapter, 
only  one  additional  stage  b  assumed. 

The  Augmented  Delta  network  fault  model  b  the  following. 

1.  Both  switching  elements  and  links  can  fail. 

2.  Stage  0  switching  elements  and  stage  n  nodes  are  always  fault-free. 

3.  Faults  occur  independently. 

4.  F aulty  links  or  switching  elements  are  unusable. 

The  fault- tolerance  criterion  b  retention  of  full  access  capability.  Under  thb 
fault-tolerance  model,  an  Augmented  Delta  network  constructed  from  2x2 
switches  b  single  fault  tolerant.  If  b  x  b  switching  elements  are  used 
throughout,  the  network  b  (b-l)-fault  tolerant. 

Consider  packet-switched  operation  of  the  network.  Let  throughput  be  the 
average  rate  at  which  packets  exit  the  network.  Normalized  throughput  is  the 


ratio  of  throughput  obtained  to  maximum  throughput  possible  if  no  conflicts 
occurred  in  the  network.  When  fault-free,  an  Augmented  Delta  network  has  a 
slightly  lower  normalized  throughput  than  a  comparable  Delta  network  [DiJ82]. 
For  the  worst  case  single  switching  element  or  link  fault  in  an  Augmented 
Delta,  performance  falls  to  about  half  that  of  the  fault-free  case.  On  the 
average,  however,  performance  is  only  slightly  degraded. 

Routing  in  the  Augmented  Delta  network  is  performed  using  routing  tags 
with  one  additional  digit  (as  compared  to  Delta  network  routing  tags  [PatSl]) 
to  control  the  extra  stage.  A  switching  element  is  assumed  able  to  detect  a 
fault  in  the  switches  and  links  to  which  it  is  directly  connected  and  able  to  pass 
fault  information  to  adjacent  switches  [DiaSl],  When  a  fault  occurs, 
neighboring  switches  propagate  notice  of  the  fault  to  their  neighbors,  and  so 
on.  For  a  bidirectional  network,  propagation  of  information  proceeds  to  both 
the  “inputs”  and  “outputs.”  Eventually,  network  input  and/or  output  switches 
will  receive  information  indicating  which  of  their  outputs  or  inputs, 
respectively,  is  part  of  a  path  through  the  network  leading  to  the  fault. 
Routing  tags  are  subsequently  formed  so  as  to  never  use  this  path.  Note  that 
output  switches  need  not  be  informed  of  the  fault  in  a  unidirectional  network. 

5.2.S  Multipath  Omega  Network 

The  Multipath  Omega  network  {PaL83a,  PaL83b,  Pad84j  is  derived  from 
the  Omega  multistage  interconnection  network  [Law75).  A  B“  x  B“  Omega 
network  consbts  of  m  stages  of  B  x  B  crossbar  switching  elements  linked  by 
(B*B'""*)-shuffle  interconnections.  An  X^Y-ahuffJe  permutes  X*Y  elements. 
Let  0<I<X*Y-1,  where  1  is  represented  as  the  x  +  y  bit  binary  number 


I  =  ix+y-1  .ilio.  X  =log2X,  and  y  =log2Y.  Then 

X*^Y“shuffle{I)  “  ly— |iy— ^y*i***^  ^  1^  f 

i.e.,  I  is  rotated  left  by  x  bit  positions.  Figure  5.3  shouts  an  Omega  network 
for  B  =  2,  N  =  8,  and  m  =  3. 

An  Omega  network  has  only  one  path  between  any  input  and  any  output; 
it  is  not  single-fault  tolerant.  To  overcome  this  difficulty,  Omega  networks 
with  multiple  paths  were  proposed.  For  the  Multipath  Omega  network  the 
fault-tolerance  criterion  is  full  access.  Its  fault  model  is  the  following. 

1.  Both  switching  elements  and  links  fail. 

2.  Input  and  output  stage  switching  elements  are  always  fault-free. 

3.  Faults  occur  independently. 

4.  Faulty  components  are  unusable. 


Figure  5.4(a)  shows  a  Multipath  Omega  network  for  N  =  16.  Its  structure 
b  described  by  the  pseudofactorization  <4,2,2,4>  of  N.  A  paeudofactorization 
of  N  is  an  f-tuple  <Bi,B2,...,Bf>  of  integers  with  Bi*B2*...*Bf  =  B,  such 
that  B  is  a  multiple  of  N.  Let  B/N  =  R,  then  an  R-path  Multipath  Omega 
network  corresponding  to  the  pseudofactorization  has  f  stages  with  stage  i 
consbting  of  B^  x  B-,  crossbar  switches.  Links  entering  stage  i  switches 


implement  the 


shuffle  interconnection,  where  k;  >  1,  and  kj  b  an 


integer  chosen  so  that  there  are  exactly  R  ways  to  connect  any  network  input 
to  network  any  output.  For  the  network  of  Figure  5.4(a),  k]  =  1,  kj  =  I, 
k3=8,  and  k4  =  l.  R  b  known  as  the  redundancy  of  the  Multipath  Omega 
network.  For  the  network  of  Figure  5.4(a),  B/N  =(4*2*2*4)/16  =4,  so  the 


1 


1 
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redundancy  of  this  network  is  four.  This  can  be  illustrated  by  a  redundancy 
graph,  which  shows  the  topology  of  the  redundant  paths. 

Redundancy  graphs  are  directed  graphs  with  the  following  properties. 

1.  Graph  nodes  form  S  classes  corresponding  to  the  S  stages  of  switches  in 
the  network. 

2.  Each  edge  in  the  graph  connects  a  node  in  class  i  to  one  in  class  i  +  1, 
1  <  i  <  S. 

3.  All  nodes  of  a  class  share  the  same  in*degree  and  outrdegree  (numbers 
of  entering  and  exiting  edges). 

The  redundancy  graph  of  a  network  is  a  subgraph  of  that  network 
corresponding  to  all  paths  between  any  given  input  and  any  given  output. 
Figure  5.4(b)  shows  the  redundancy  graph  for  the  network  of  Figure  5.4(a). 
Figure  5.5  depicts  other  possible  redundancy  graphs,  each  relating  to  a 
different  network  structure. 

The  line  connectivity  of  a  redundancy  graph  is  the  number  of  distinct 
paths  from  the  node  in  the  class  corresponding  to  the  input  stage  of  the 
network  to  the  node  corresponding  to  the  output  stage,  i.e.,  the  number  of 
distinct  paths  between  any  input  and  any  output.  The  redundancy  graphs  of 
Figure  5.5(a)  has  a  line  connectivity  of  four;  that  of  Figure  5.5(b),  two.  The 
graph  of  Figure  5.5(c)  has  a  line  connectivity  of  one;  it  is  the  redundancy 
graph  for  an  Omega  network  with  four  stipes  of  switches.  A  Multipath  Omega 
network  having  a  redundancy  graph  with  line  connectivity  X  is  (X  — l)-fault 
tolerant,  so  a  network  with  the  redundancy  graph  of  Figure  5.5(a)  is  three- 
fault  tolerant. 


Routing  in  the  Multipath  Omega  network  is  controlled  by  routing  tags 
and  is  similar  to  the  procedure  used  for  the  Omega  network  [Law75].  Stage  i 

switch  settings  are  controlled  by  b,-  =  log2B|  bits.  A  routing  tag  consists  of 
f 

^bj  bits,  where  this  sum  is  n  +  r,  r  =  Iog2R.  The  destination  address 

i=I 

determines  n  bits  of  the  tag;  the  remaining  r  bits  select  a  particular  path  out  of 
the  R  alternatives.  These  r  bits  are  termed  redundant  bite. 

Routing  tags  that  specify  a  fault-free  path  are  generated  by  one  of  three 
possible  methods  described  in  (PaL83a}.  In  non-adaptive  routing  a  source 
learns  of  a  fault  only  when  the  path  it  is  attempting  to  establish  reaches  the 
faulty  network  element.  Notice  is  sent  back  to  the  source,  which  tries  the  next 
alternative  path  (the  redundant  bits  in  the  routing  tag  are  changed  so  as  to 
increment  the  binary  number  that  can  be  formed  by  concatenating  these  bits). 
This  approach  requires  little  hardware  but  may  have  poor  performance.  Two 
forms  of  adaptive  routing  are  proposed.  With  notification  on  demand  a  source 
maintains  a  table  of  faults  it  has  encountered  and  uses  this  information  to 
guide  future  routing.  With  broadcast  notification  of  a  fault,  all  sources  that 
can  use  the  faulty  component  in  a  path  are  notified  and  keep  a  table,  as  with 
notification  on  demand. 

5.2.4  F-Network 

The  F-network  (CiS82]  connects  N  =  2®  inputs  to  N  outputs  via  n  + 1 
stages  of  N  switching  elements  which  are,  in  general,  4-input /4-output  devices 
that  connect  one  input  to  one  output.  A  switching  element  in  stage  j,  Pj,  is 
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denoted  by  a  bit  string  Pj  =  p„-i...piPo.  It  connects  to  the  stage  j  + 1 
switching  elements 

Pj  +  i  ~  Pn-l"  PlPo  I 
Qj  +  l  ~  Pn-l‘"Pj  +  lPjPj-l***PlPo  I 
^j+1  ~  Po-1— Pj  +  lPjPj-1— PlPo  »  “d 
Sj  +  i  ==  Pn-I  -Pj  +  lPjPj-1— PlPo  * 

Figure  5.6  shows  the  F-network  for  N  =  8.  Stages  are  numbered  from  left  to 
right  ranging  from  0  to  n,  and  within  each  stage,  switching  elements  are 
numbered  from  0  to  N-1.  The  F-network  con  tuns  the  structure  of  the 
Generalized  Cube  network  and  can  emulate  it  using  only  the  Pj+i  nnd  Qj+i 
connections.  Thus,  the  fault  tolerance  approach  of  the  F-network  is  to  add 
links  and  to  the  Generalized  Cube  structure,  unlike  the  approach 
of  the  Modified  Baseline  and  Augmented  Delta  networks. 

The  F-network  fault  model  assumes  the  following  |CiS82]. 

1.  Only  switching  elements  fail. 

2.  Stage  0  and  n  switching  elements  ue  always  fault-free. 

3.  F aults  occur  independently. 

4.  A  faulty  switching  element  is  unusable. 

The  fault-tolerance  criterion  for  the  F-network  is  retention  of  full  access.  The 
network  is  single  fault  tolerant  and  robust  in  the  presence  of  multiple  faults 
(CiS82]  with  respect  to  its  fault-tolerance  model.  An  expression  for  network 
mean  time  before  failure  (MTBF)  is  derived  in  [CiS82). 
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Routing  in  the  F-network  is  accomplished  with  routing  tags  defined  as 
follows  (CiS82].  Let  the  source  have  address  S  and  the  destination  address  D. 
Define  the  routing  tag  C  =  Cg.i...C|Co  =  S0D,  where  0  is  the  bitwise 
exclusive-or  of  the  binary  representations  of  S  and  D.  Let  r  be  a  binary 
variable  initially  set  to  0,  and  let  r>  be  the  value  of  r  after  j  steps  of  the  routing 
algorithm  in  Figure  5.7. 

At  each  step  of  the  algorithm  it  is  possible  to  calculate  the  choice  of  next 
stage  switch  in  two  ways.  Two  different  switches  can  be  selected  for  all  stages 
except  n,  the  output  stage,  by  changing  the  value  assigned  to  x.  The  fault 
tolerance  of  the  F-network  arises  from  this  ability  to  choose  two  switching 
elements  in  the  next  stage  to  continue  a  path.  A  faulty  or  busy  switch  can  be 
avoided  by  taking  the  appropriate  path.  This  is  done  by  changing  the  value  of 
X  upon  detecting  that  the  selected  next  switch  is  faulty  or  busy  and 
recalculating  the  choice  of  next  switch  based  on  the  new  value  of  x.  Because 
all  paths  from  a  given  source  go  through  the  same  stage  0  switch,  and  all  paths 
to  a  given  destination  go  through  the  same  stage  n  switch,  only  interior  stage 
switching  element  faults  (not  stage  n  or  0)  can  be  tolerated. 

6.2.6  Enhanced  Inverse  Augmented  Data  Manipulator  Network 

The  Enhanced  Inverse  Augmented  Data  Manipulator  network  b  derived 
from  the  Inverse  Augmented  Data  Manipulator  (lADM).  The  lADM  network  b 
an  Augmented  Data  Manipulator  (ADM)  network  [SiS78]  with  the  order  of 
stage  traversal  reversed.  The  ADM  b  derived  from  the  data  manipulator 
network  (Fen74).  Figure  5.8  shows  the  lADM  for  N  =  8.  It  consists  of 
n  =]og2N  stages,  where  each  stage  consbts  of  N  switching  elements  and  3N 
links  that  are  connected  to  the  succeeding  stage.  Each  switching  element 


for  j  =  0  to  n  - 1  begin 

assign  x  to  be  either  0  or  1; 
if  X  =  0  then  begin 
if  Cj  ©  H  =  0  then 

choose  stage  j  +  1  switch  P]  +  i 
else  choose  stage  j  +  1  switch  Qj  +  i ; 
rj  +  »  =ri; 
end 
else 

if  2  +  (cj  ©  W)  =  2  then 

choose  stage  j  +  1  switch  Sj  +  i 
else  choose  stage  j  +  1  switch  Rj  + 1 ; 
rj  +  »  =i3; 

end; 


Figure  5.7  Routing  algorithm  for  the  F-network 


connects  one  of  its  three  inputs  to  one  of  its  three  outputs.  At  stage  i, 
0  <  i  <  n,  the  first  output  of  switching  element  j,  0  <  j  <  N,  is  connected  to  an 
input  of  switching  element  (j“2')  modulo  N  in  stage  i  +  1.  The  second  output 
is  connected  to  an  input  of  switching  element  j  in  stage  i  +  1.  Finally,  the 
third  output  is  connected  to  an  input  of  switching  element  (j  +  2')  modulo  N  in 
stage  i  +  1.  These  links  are  known  at  the  minus,  straight,  and  plus  links, 
respectively.  Since  is  congruent  to  (j  +  2"'*)  modulo  N,  there  are 

actually  just  two  distinct  logical  data  paths  from  each  switching  element  in 
stage  n  - 1  (stage  2  in  Figure  5.8).  There  is  an  additional  set  of  N  nodes  at  the 
output  stage  to  receive  data. 

Performance  and  fault- tolerance  improvements  for  the  lADM  are  discussed 
in  {McS82al.  The  resulting  network  is  called  the  Enhanced  lADM.  Its  fault 
model  is  the  same  as  for  the  Augmented  Delta  network,  as  is  the  fault- 
tolerance  criterion.  Note  that  the  lADM  is  a  robust  network  with  respect  to 
this  model  and  criterion,  but  it  is  not  single-fault  tolerant. 

One  method  of  providing  fault  tolerance  for  the  lADM  is  adding 
redundant  straight  links.  A  faulty  straight  link  can  be  avoided  by  using  the 
second  straight  link.  Faulty  plus  or  minus  links  can  be  avoided  by  taking  the 
alternative  path  available  at  the  stage  just  prior  to  the  faulty  link  [McS82a]. 
However,  switching  element  faults  cannot  be  tolerated  by  adding  redundant 
straight  links. 

A  second,  more  effective  modification  to  gain  fault  tolerance  is  to  add  half 
links  to  each  of  stages  1  through  n-1.  Half  links  connect  a  switching  element 
m  in  stage  i  to  switching  elements  (m  +  2'“’)  modulo  N  and 
(m-2‘~')  modulo  N.  This  is  shown  for  N  =  8  in  Figure  5.9.  Adding  half  links 
provides  single-fault  tolerance  to  any  switching  element  or  link  failure.  This  is 


because  at  any  switching  element  (except  those  in  stage  n-1,  the  last  stage) 
along  a  route  from  a  network  input  to  output  there  are  at  least  two  (sometimes 
three)  links  leading  to  distinct  switching  elements  in  the  successive  stage,  any 
of  which  can  be  used  to  satisfy  the  overall  routing  need.  With  a  single-stage 
look-ahead  technique  [McS82a]  the  network  becomes  dynamically  two-fault 
tolerant.  That  is,  messages  will  not  be  sent  along  a  route  on  which  all 
alternative  paths  to  the  next  stage  are  blocked  by  the  two  faults.  Further 
modifications  of  the  two  hardware  enhancement  schemes  presented  are 
discussed  in  (McS82a]. 

Routing  for  the  Enhanced  lADM  network  with  redundant  straight  links  is 
exactly  the  same  as  for  the  lADM  network  [McS82a),  because  no  new  paths 
between  inputs  and  outputs  are  provided  by  the  additional  links.  A  routing 
tag,  T,  for  the  lADM  network  (with  or  without  redundant  straight  links)  is 
computed  as  T  =  tnt„_j...t,to  =  D-S,  where  S  is  the  source  address,  and  D 
the  destination  address.  The  tag  T  is  expressed  in  signed  magnitude  notation, 
where  t„  is  the  sign  bit. 

For  the  Enhanced  lADM  network  with  half  links  the  routing  tag  scheme 
uses  a  tag  T  computed  as  T=D-S  if  D>S,  else  T  =  2"-(S-D).  Each 
switching  element  is  assumed  to  have  the  ability  to  determine  if  the  switching 
elements  or  links  to  which  it  is  connected  are  faulty  and,  if  so,  to  modify 
routing  tags  dynamically  to  avoid  a  faulty  component.  Considerable  switching 
element  logic  must  be  devoted  to  interpreting  and  modifying  tags  as 
information  flows  through  the  network  if  the  full  fault-tolerance  capabilities  of 
the  network  are  to  be  achieved.  This  makes  the  switching  elements 
appropriate  for  VTSI  implementation.  Note  that  no  burden  is  placed  on 
devices  using  the  network  in  achieving  these  fault-tolerance  capabilities. 


5.2.6  Gamma  Network 


The  Gamma  network  [PaR82,  PaR84]  is  an  adaptation  of  the  lADM 
network  (see  Figure  5.8)  and  has  redundant  paths  connecting  2"  =  N  inputs  to 
N  outputs  and  consists  of  n  stages  of  N  switches.  The  Gamma  network  is 
shown  for  N=8  in  Figure  5.10.  However,  unlike  the  lADM,  each  of  the 
switching  elements  is,  in  general,  a  3-input/3-output  crossbar  switch  instead  of 
a  one-of-three  inputs  to  one-of-three  outputs  selector.  Switching  elements  in 
the  input  stage  have  only  one  input  and  three  outputs,  while  output  stage 
switches  have  three  inputs  and  only  one  output.  The  link  connection  pattern  is 
identical  to  that  of  the  lADM. 

The  fault  tolerant  nature  of  the  Gamma  network  can  be  related  to  certain 
number  systems.  A  number  system  is  a  method  for  expressing  numerical 
values.  In  a  radix  r  number  system,  values  are  represented  by  digit  strings 
where  each  digit  can  have  any  of  the  r  values  {0,  1,  ...,  r  -  1}.  Number  systems 
in  which  digits  have  more  than  r  possible  values  can  be  constructed.  Let  each 
digit  be  in  the  set  {-a,  -  (a  -  1),  ...,-1,0,  1,  ...,  a}.  When  r  >  2  and  a  =  r  —  1  a 
radix  r  fully  redundant  number  system  is  formed  in  which  each  digit  is  in  the 
range  {-(r- 1), -(r-2),  ...,  -  1,  0,  1,  ...,  r-1}.  This  number  system  is 
redundant  in  the  sense  that  some  values  will  have  more  than  one 

representation.  In  fact,  all  non-zero  values  have  multiple  forms.  A  radix  2 

(binary)  fully  redundant  number  system  can  use  the  digits  1,  0,  and  1,  where  1 
corresponds  to  -1, 

Consider  a  source,  S,  a  destination,  D,  a.id  their  difference 

(D-S)  modulo  N.  Each  representation  of  (D-S)  modulo  N  and 
((D-.S)  modulo  N|-N  in  the  binary  fully  redundant  number  system 

corresponds  to  a  path  in  the  Gamma  network  connecting  S  to  D.  Thus,  as  long 


as  S  7^  D  there  will  be  multiple  paths.  This  is  the  source  of  Gamma  network 
fault  tolerance.  If  S  =  D,  only  one  path  exists  (as  in  the  lADM).  In  general, 
the  number  of  paths  is  given  by 


Pn(x)  = 


n-1 


Pn-l 


modulo  N 


,  X  even 


X - modulo  N 

2 


+  P, 


n-l 


X  +  —  modulo  N 
2 


X  odd 


where  x  =  (D-S)  modulo  N,  Pi(0)  =  1,  and  Pi(l)  =  2.  Note  that  P„(0)  =  1 
for  all  n  and  that  P„(x)  >  1  for  x  ^0.  Table  5.1  lists  the  number  of  paths  as  a 
function  of  the  value  of  (D-S)  modulo  N  {PaR82j  for  N  up  to  16. 

The  Gamma  network  can  be  controlled  by  n  digit  routing  tags,  the  value 
of  which  is  the  difference  modulo  N  between  the  numbers  of  the  network  input 
and  output  to  be  connected.  The  digits  of  the  tag  may  be  1,  0,  or  -1, 
corresponding  to  the  +2',  straight,  and  -2‘  links,  respectively.  Control  of  the 
Gamma  network  when  faults  occur  is  not  explicitly  specified. 

The  Gamma  network  can  perform  all  permutations  passable  by  the  lADM 
network  and  some  that  it  cannot.  For  example,  the  lADM  cannot  perform  all 
permutations  for  N  =  8  or  the  perfect  shuffle  for  all  N>8  [SSM80,  AdS82b], 
but  the  Gamma  network  can  perform  these  permutations. 

A  fault  model  that  can  be  used  for  the  Gamma  network  assumes  the 
following. 


1.  Only  switching  elements  fail. 


2.  The  input  and  output  stage  switching  elements  are  always  fault-free. 

3.  Faults  occur  independently. 

4.  F aulty  switching  elements  are  unusable. 

The  fault-tolerance  criterion  appropriate  for  the  Gamma  network  is  full  access 
with  the  exception  that  an  input  need  not  be  able  to  connect  to  the  identically 
numbered  output.  Under  this  fault-tolerance  model  the  network  is  single  fault 
tolerant.  For  a  computer  system  with  a  PEJ-to-PE  structure,  not  requiring  the 
ability  to  connect  an  input  to  an  identically  numbered  output  (an  identity 
connection)  is  reasonable  since  a  PE  should  not  need  to  communicate  with 
itself.  For  the  P-to-M  model  on  the  other  hand,  identity  connections  are  likely 
to  be  important. 

5.2.7  Fault-Tolerant  Beneh  Network 

The  Fault-Tolerant  Benes  Network  is  derived  from  the  Benes  network 
(Ben65].  A  Benes  network  connects  N=2"  inputs  to  N  outputs  via  2n  — 1 
stages  each  with  N/2  2-input/2-output  switching  elements  and  is  a  particular 
instance  of  the  more  general  Clos  network  (Clo53].  The  switching  elements  can 
be  set  to  one  of  two  states:  straight  or  exchange.  Figure  5.11  shows  the  BeneS 
network  for  N  =  8.  The  Benes  network  is  a  rearrangeahle  network  in  that  any 
idle  inpiit/output  pair  can  be  connected  by  rerouting  any  established  one-to- 
one  connections  as  necessary.  In  other  words,  any  one-to-one  connection  can 
be  established  regardless  of  any  existing  one-to-one  connections.  Thus,  the 
Benes  network  can  perform  any  permutation  of  inputs  to  outputs. 

The  fault  model  used  in  [SoR80)  for  the  analysis  of  fault  tolerance  of  the 
Benes  network  is  a  switching  element  stuck-fault  model.  That  is,  a  switching 
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element  can  be  stuck  in  the  straight  setting,  or  stuck  in  the  exchange  setting. 
Specifically, 

1.  Only  switching  elements  fail,  and  they  fail  by  becoming  stuck  in  one  of 
their  two  states. 

2.  Faults  occur  independently. 

3.  F aulty  switching  elements  are  usable. 

This  is  a  relatively  weak  fault  model,  in  that  it  supposes  an  optimistic  view  of 
hardware  behavior.  For  example,  other  switching  element  failure  modes  may 
well  be  possible,  such  as  ones  where  continued  use  of  the  switching  element  is 
not  possible.  Link  failures  may  also  occur  in  a  physical  network. 

The  Benes  network  fault-tolerance  criterion  is  retaining  the  ability  to 
perform  any  permutation  connection  in  a  single  pass  through  the  network, 
known  as  full  connection  capability.  It  b  the  most  stringent  fault- tolerance 
criterion  of  all  the  networks  surveyed,  but  the  Benes  network  b  the  most 
capable  of  these  networks,  in  terms  of  permuting  capability.  The  Benes 
network  can  tolerate  most  single  faults,  as  defined  by  this  fault- tolerance 
model.  Some  multiple  switching  element  faults  not  in  the  center  stage  can  be 
tolerated  as  well,  so  the  network  is  robust.  However,  if  any  single  switching 
element  in  the  center  stage  is  stuck  at  the  exchange  setting  then  the  identity 
permutation,  which  connects  each  input  to  the  identically  numbered  output, 
cannot  be  performed.  Also,  if  any  center  stage  switching  element  is  stuck  at 
the  straight  setting  then  the  uniform  shift  connecting  each  input  i,  0<i<N, 
to  output  i  +  N/2  modulo  N  is  one  permutation  no  longer  possible. 


^  M  L.  a  L.V 


Any  center  stage  fault  is  correctable  by  adding  a  single  switching  element 
at  the  input  or  output  stage  {SoR80{.  The  configuration  of  the  fault-tolerant 
network  with  the  extra  switching  element  at  the  output  is  shown  in 
Figure  5.12  for  N=8.  Tolerance  of  a  fault  is  achieved  by  using  the  extra 
switching  element  to  correct  for  the  misrouting  (if  any)  caused  by  the  fault. 
Further  modifications  of  the  Benes  network  allowing  multiple-fault  tolerance  to 
switching  element  stuck-at  faults,  but  requiring  extra  stages  of  switches,  are 
described  in  [S0R8O]. 

Fault-Tolerant  Benes  network  routing  is  performed  by  computing  the 
necessary  settings  of  all  the  switching  elements,  and  then  imposing  that  state 
on  the  network  through  control  lines,  one  per  switch.  The  algorithm  for  the 
Benes  network  to  compute  control  information  for  any  permutation  requires 
O(NlogN)  time  for  execution  (OpT7l)  and  global  knowledge  of  the  control 
state  of  the  network,  i.e.,  centralized  control.  The  algorithm  does  not  avoid 
faulty  switches;  required  switch  settings  can  be  adjusted  to  match  the  state  of 
a  stuck  switch.  Faulty  switches  must  be  used  if  permutations  are  to  be 
performed  in  only  one  pass  through  the  network. 

A  new  algorithm  for  controlling  the  Benes  network  was  presented  in 
(Lee84].  Unlike  the  algorithm  in  jOpT71],  it  is  not  recursive  and  requires  only 
knowledge  of  the  control  state  of  the  previous  stage,  rather  than  the  entire 
network.  However,  it  too  has  O(NlogN)  time  complexity,  and  also  implies 
centralized  network  control. 

In  the  Fault-Tolerant  BeneS  network,  routing  in  the  case  of  a  fault  not  in 
a  center  stage  switching  element  is  performed  as  in  the  fault-free  network, 
except  that  the  reducible  connection  set  assignment,  which  determines 
subnetwork  settings,  is  changed,  if  needed,  so  that  the  subnetwork  with  the 
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stuck  switch  is  assigned  the  reducible  connection  set  that  requires  the  faulty 
switch  to  be  set  as  it  is  stuck  (S0R8OJ.  For  permutations  affected  by  a  fault  in 
the  center  stage,  routing  is  accomplished  by  first  computing  the  control  signals 
that  would  be  used  in  the  Benes  network  if  outputs  0  and  4  of  the  permutation 
were  interchanged.  The  outputs  0  and  4  are  then  exchanged  by  the  additional 
switching  element. 

In  general,  the  combination  of  centralized  control  and  O(NlogN)  time 
make  both  the  Bene^  and  the  Fault-Tolerant  BeneS  networks  unsuitable  for  a 
parallel/distributed  system  in  which  the  network  is  often  reconfigured. 
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6.2.8  Augmented  C-Network 

The  Augmented  C-network  (ACN)  is  derived  from  the  C-network  (ReKSd). 
The  C-network  connects  N  inputs  to  N  outputs  via  m  stages,  m  >  0,  of  2  x  2 
switching  elements,  each  stage  with  N/2  switches.  Stages  are  numbered  0  to 
m-1  from  input  to  output.  C-networks  are  a  broad  family  of  interconnection 
networks.  The  Baseline,  Omega,  Generalized  Cube,  and  Benes  networks  are  all 
instances  of  C-networks. 

Let  one  output  of  a  switch  be  labeled  0  and  the  other,  I.  For  a  switch  S 
in  stage  i,  i  ?^m-l,  its  O-sveeessor,  denoted  succ®(S),  is  the  switch  in  stage  i  +  1 
connected  to  its  0  output.  The  1-auceessor,  succ*(S),  is  the  stage  i  +  1  switch 
connected  to  its  1  output. 

The  topology  of  a  C-network  is  defined  by  the  following  relationship.  For 
each  switch  Sj  in  stage  i,  0  <  i  <  m-1,  0  <  j  <  N/2,  there  exists  a  switch  S^, 
0  <  k  <  N/2  and  ki^j,  in  stage  i  such  that  succ°(Sj)  =succ®(St)  and 
succ*(Sj)  =  succ’(Sij).  Switches  Sj  and  Sjj  and  said  to  be  conjugate;  this  is 
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denoted  as  conj(Sj)  —  and  conj(S|j)  —  Sj. 

The  notion  of  conjugate  switches  was  discussed  in  [Agr83]  using  the  term 
output  buddies.  The  concept  of  input  buddies,  the  0-  and  1-successors  of  a 
switch  and  its  conjugate,  was  also  noted.  A  network  in  which  two  pairs  of 
input  buddies  also  constitute  two  pairs  of  output  buddies  has  the  strict  buddy 
property.  G-networks  do  not  necessarily  have  this  property. 

A  C-network  with  the  additional  property  that  succ®(S) /conj(succ*(S)) 
and  succ*(S)  conj(succ®{S))  is  the  basis  for  the  ACN.  So  that  all  networks  in 
this  survey  can  be  compared  on  an  equal  basis  the  C-network  will  be  assumed 
to  have  n  =  log2N  stages.  Beginning  with  such  a  C-network,  an  ACN  is 
constructed  by  replacing  all  the  2x2  switching  elements  with  4x4  crossbar 
switches  with  inputs  labeled  0,  1,  conj(O),  and  conj(l).  Switch  outputs  are 
labeled  similarly  (see  Figure  5.13).  Those  switch  inputs  and  outputs  labeled  0 
and  1  are  connected  exactly  as  in  the  base  C-network.  The  conj(O)  and  conj(l) 
ports  are  connected  as  follows.  In  stage  0,  switch  S  inputs  conj(O)  and  conj(l) 
are  connected  to  the  sources  connected  to  the  0  and  1  inputs  of  switch  conj(S), 
respectively.  This  is  shown  in  Figure  5.14(a).  In  stage  n-1,  switch  S  outputs 
conj(O)  and  conj(l)  are  connected  to  the  outputs  connected  to  the  0-  and  1- 
outputs  of  switch  conj(S),  respectively  (see  Figure  5.14(b)).  The  conj(O)  and 
conj(l)  outputs  of  switches  S,  and  conj(SJ  in  stage  i,  0  <  i  <  n-1,  are 
connected  to  the  conj(O)  and  conj(l)  inputs,  respectively,  of  conj(succ®(SJ)  and 
conj(succ'(S^)),  where  succ®(Sj)  =Sj,  and  succ’(SJ  =5^  (see  Figure  5.14(c)). 

The  ACN  fault  model  implied  in  |ReK84]  is  the  following. 


1.  Both  switching  elements  and  links  fail. 


STAGED 

(») 


STAGE  n-1 
(b) 


STAGE  i  i+1 

(0 

Figure  5.14  (a)  Connections  to  stage  0  switches  in  an  ACN.  (b)  Connections 
from  stage  n-1  in  an  ACN,  (c)  Connections  from  a  conjugate 
switch  pair  in  an  ACN. 
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2.  Faulty  switching  elements  are  usable. 

The  ACN  fault-tolerance  criterion  is  retention  of  full  access  capability.  The 
ACN  provides  2"  distinct  paths  between  any  source  and  destination,  however, 
most  of  these  paths  are  not  disjoint.  In  general,  there  are  at  least  two  switches 
at  any  stage  and  two  links  between  stages  by  which  a  given  source  and 
destination  can  be  joined.  Thus,  the  ACN  is  single-fault  tolerant  to  both 
switch  and  link  failures. 

Routing  in  the  ACN  is  predicated  on  a  routing  tag  scheme  existing  for  the 
base  C-network.  It  is  assumed  that  a  switch  can  determine  when  a  successor 
switch  is  faulty.  In  the  case  of  no  faults,  the  C-network  routing  tag  is 
determined  and  interpreted  as  for  the  base  C-network;  the  standard  path  is 
taken  through  the  ACN. 

In  case  there  are  unavailable  switches  (either  due  to  failure  or  previously 
established  data  transfer)  two  routing  strategies  are  proposed  (ReK84).  Each 
strategy  begins  with  the  routing  tag  for  the  base  C-network.  The  Orst  strategy 
is  defined  by  the  flow  chart  shown  in  Figure  5.15.  The  second  strategy  is 
similar  to  the  first,  and  is  shown  in  Figure  5.16.  Either  routing  scheme  takes 
full  advantage  of  the  fault  tolerance  characteristics  of  the  ACN. 

6.2.0  jS-Networks 

A  ^-network  is  formed  by  interconnecting  a  set  of  /^-elements  [ShH84].  A 
^element  is  a  2-input /2-output  switch  that  can  perform  the  two  permutations 
straight  and  exchange.  That  is,  a  /3-element  is  a  two-state  interchange  box 
[SiS78,  Sie79a].  Thus,  many  networks,  including  the  F ault-Tolerant  Benes  and 
Modified  Baseline,  are  ^-networks.  An  example  of  a  simple  /3-network  is  shown 
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Figure  5.15  Flow  chart  for  one  ACN  routing  strategy. 


in  Figure  5.17.  Although  they  may  duplicate  the  topologies  of  other  surveyed 
networks,  /9-networks  are  included  for  their  fault  model  and  fault- tolerance 
criterion,  as  these  differ  substantially  from  those  of  the  other  networks 
considered  in  this  chapter.  This  provides  a  more  complete  view  of  the  state  of 
the  art  in  fault-tolerant  multistage  networks.  In  the  sense  that  /9-networks  can 
be  designed  to  enhance  their  ability  to  tolerate  faults,  they  are  included  in  this 
survey  of  networks  so  designed. 

A  /9-network  is  defined  as  having  the  dynamic  full-access  property  if  any 
given  network  input  can  be  connected  to  any  single  network  output  in  a  finite 
number  of  passes  through  the  network.  Between  passes  it  is  assumed  that  each 
output  can  connect  to  its  corresponding  input  (i.e.,  the  input  with  the  same 
number  as  the  output)  via  a  path  outside  the  network.  The  /9-network  is  said 
to  tolerate  a  fault  if  the  fault  does  not  destroy  dynamic  full-access  capability. 
This  is  a  considerably  less  restrictive  fault-tolerance  criterion  than  those  of  the 
other  networks  surveyed.  The  purpose  in  using  the  dynamic  full-access 
measure  is  to  better  characterize  the  connectivity  requirements  of  computer 
systems  than  either  full-access  or  rearrangeability  (full  connection)  capability 
[ShH80].  However,  the  multiple  pass  method  of  network  operation  implied  by 
the  dynamic  full  access  criterion  may  be  unsuited  for  some,  if  not  many, 
applications. 

The  fault  model  used  for  /9-networks  includes  two  failure  modes  in  the  /9- 
element  only.  These  failures  are  stuck  at  straight  and  stuck  at  exchange.  The 
/9-elements  are  assumed  to  remain  able  to  pass  data  despite  faults.  This  is  the 
same  failure  mode  assumed  for  .switching  elements  in  the  Fault-Tolerant  Benes 
network. 


A  graph  model  is  used  to  study  faults  and  fault  tolerance  in  /^-networks. 
A  y9-graph  is  a  directed  graph  with  vertices  representing  the  )9-elements,  and 
edges  representing  the  links  in  the  ^network.  There  is  an  edge  from  vertex  i 
to  vertex  j  of  the  ;3-graph  if  an  output  terminal  of  ^element  i  connects  to  an 
input  terminal  of  yS-element  j.  Figure  5.18  shows  the  y?-graph  of  the  example 
/^-network  in  Figure  5.17.  For  multistage  y9-networks,  edges  in  the  ^-graph  can 
denote  network  links  or  devices  using  the  network.  The  edges  representing 
devices  using  the  network  are  called  computer  edges. 

A  critical  fault  is  a  collection  of  stuck  j5-elements  which  destroys  the 
dynamic  full-access  property  of  the  /^-network.  A  faulty  /3-element  is  stuck  at 
either  straight  or  exchange,  so  in  the  /3-graph  it  can  be  represented  as  a  split 
vertex.  A  strongly  connected  directed  graph  is  one  such  that  between  each  pair 
of  vertices  a  and  b  there  exists  a  path  from  a  to  b  as  well  as  from  b  to  a 
(JoJ72].  Thus,  the  dynamic  full- access  property  is  lost  by  a  fault  which 
disconnects  the  originally  strongly  connected  /3-graph.  A  network  design  that 
retains  dynamic  full  access  for  the  maximum  number  of  faults  is  discussed  in 
[AgL84], 

A  minimal  critical  fault  is  a  critical  fault  for  which  no  proper  subset 
constitutes  a  critical  fault.  Minimal  critical  faults  can  be  characterized  in  two 
ways.  One  is  by  circuit  partitions,  a  collection  of  edge-disjoint  elementary 
circuits  of  a  /?-graph  such  that  each  vertex  belongs  to  exactly  two  elementary 
circuits  and  each  edge  belongs  to  exactly  one  elementary  circuit.  Figure  5.19 
illustrates  two  circuit  partitions  of  the  example  /3-network. 

For  every  circuit  partition  there  is  a  circuit  adjacency  graph  which  is  an 
undirected  graph  whose  vertices  represent  the  elementary  circuits  of  the 
partition,  and  whose  edges  represent  the  vertices  of  the  /3-graph.  Figure  5.20 
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shows  the  two  circuit  adjacency  graphs  for  the  example  ^network  circuit 
partitions. 

A  cutset  is  a  minimal  set  of  edges  in  a  graph  with  p  separate  parts,  the 
removal  of  which  results  in  a  graph  with  p  +  1  parts  [JoJ72).  Every  cutset  of  a 
circuit  adjacency  graph  represents  a  minimal  critical  fault.  Also,  every  minimal 
critical  fault  can  be  represented  by  a  cutset  of  a  circuit  adjacency  graph.  That 
is,  a  one-to-one  correspondence  between  circuit  adjacency  graphs  and  minimal 
critical  faults  exists. 

A  second  way  to  characterize  minimal  critical  faults  is  with  Eulerian 
circuits,  which  are  circuits  that  use  every  edge  in  a  directed  graph  once  [JoJ72]. 
Figure  5.21  gives  the  two  Eulerian  circuits  of  the  example  )9-graph.  A  p- 
network  fault  is  critical  if  it  is  not  compatible  with  any  Eulerian  circuit  of  the 
/^-network.  Two  network  setting  are  compatible  if  all  specified  /9-element 
settings  (straight  or  exchange)  match.  Unspecified  ^element  settings  (“don’t 
care”  settings)  always  match.  A  fault  can  be  considered  a  partial  setting  where 
the  faulty  /9-elements  are  specified  and  the  remainder  are  “don’t  care.” 

There  are  two  important  disadvantages  to  the  /9-network  approach  to 
fault- tolerant  networks.  One  is  the  computational  complexity  of  using  the 
critical  fault  characterizations.  Even  when  faults  have  been  detected  and 
located  considerable  work  remains  to  determine  the  operational  status  of  the 
network.  Specifically,  the  set  of  located  faults  must  be  tested  to  see  if  it 
comprises  a  critical  fault.  This  can  be  done  by  checking  to  see  if  any  subset  of 
the  faults  constitutes  a  minimal  critical  fault.  A  specific  testing  technique  is 
presented  in  [AgL84].  A  list  of  minimal  critical  faults  can  be  precomputed  for 
a  given  /9-network  structure.  The  second  disadvantage  is  that  by  allowing  a 
finite  number  of  passes  through  the  network,  data  transit  time  becomes 
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variable.  This  will  impose  burdens  on  an  SIMD  system  attempting  to  maintain 
synchronization. 

Routing  in  a  ^^-network  can  be  accomplished  using  binary  routing  tags 
with  as  many  bit  positions  as  there  are  stages  in  the  network.  However,  ^ 
networks  constitute  such  a  broad  class  that  no  one  routing  tag  scheme  is 
generally  applicable.  Also,  realization  of  dynamic  full  access  capability  may 
incur  significant  computational  expense  for  routing  tags,  since  a  set  of  tags 
leading  from  the  original  source  via  a  finite  number  of  passes  through  the 
network  to  the  ultimate  destination  must  be  generated.  No  scheme  for 
generating  such  a  set  of  tags  is  given  in  [ShHSO]. 


5.3  Comparison 

Table  5.2  summarizes  the  network  fault  tolerance  information  presented 
in  Section  5.2.  It  gives  the  possible  faults  that  can  occur  in  each  network 
under  the  assumed  fault  model,  whether  or  not  faulty  components  are  usable, 
the  fault-tolerance  criterion,  the  method  by  which  the  network  copes  with 
faults,  whether  the  network  is  single-fault  tolerant,  and  how  it  performs  when 
there  are  multiple  faults.  Note  that  the  phrase  “internal  node  faults  only”  used 
in  the  table  is  another  way  of  saying  input  and  output  switching  elements  are 
alw.ays  fault-free. 

There  is  a  growing  literature  on  fault-tolerant  multistage  interconnection 
networks  as  is  demonstrated  by  the  survey  of  Section  5.2.  However,  as  pointed 


out  in  (LLY'82|  the  results  have  several  limitations,  including: 
1.  unreasonably  optimistic  fault-tolerance  models,  and 
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Table  5.2  Summary  of  fault  tolerance  information  for  the  networks  surveyed. 
“SE”  is  an  abbreviation  for  switching  element. 


Network 

Fault 

Model 

F  ault- 

Tolerance 

Criterion 

Fault- 

Tolerance 

Method 

Single- 
F  ault 
Tolerant^ 

Multiple- 
F  ault 

Tolerance* 

Modified 

Baseline 

internal 

SE  only; 
unusable 

full 

access 

“  -  -■  — 

alternate 

route 

yes 

robust 

Augmented 

Delta 

internal 

SE  or 

link; 

unusable 

full 

access 

alternate 

route 

yes 

robust; 
(b-l)-fault 
tolerant 
with  bxb 
switches 

Multipath 

Omega 

internal 

SE  or 

link; 

unusable 

full 

access 

alternate 

route 

yes 

robust 

F-network 

internal 

SE  only; 
unusable 

full 

access 

alternate 

route 

yes 

robust 

Enhanced 

lADM 

(straight 

links) 

internal 

SE  or 
link; 

unusable 

full 

access 

alternate 

route 

no 

robust 

Enhanced 

lADM 

(half  links) 

internal 

SE  or 
link; 

unusable 

full 

access 

alternate 

route 

yes 

robust; 

2-rault 

tolerant 

with 

lookahead 

Gamma 

internal 

SE  only; 
unusable 

full  access, 
but  no 
identity 
connection 

alternate 

route 

yes 

robust 

Fault- 

Tolerant 

Benes 

SE  stuck, 
but  usable 

full 

connection 

capability 

correct 

misroute 

yes 

robust 

Augmented 

C-network 

any  SE 
or  link; 
unusable 

full 

access 

alternate 

route 

yes 

robust 

^networks 

SE  stuck, 
but  usable 

dynamic 
full  access 

repeated 

pass 

depends 
on  network 

typically 

robust 

^  Answer  depends  critically  on  fault  model  and  fault-tolerance  criterion. 


2.  increased  data  routing  complexity. 

To  place  the  ESC  in  perspective,  it  is  compared  with  the  fault- tolerant 
multistage  interconnection  networks  surveyed  in  Section  5.2.  Table  5.3 
summarizes  that  comparison.  Note  that  Table  5.3  gives  qualitative 
information;  for  example,  in  the  column  on  fault  model  the  phrase  “slightly  less 
strict”  generally  means  input  and  output  stage  switching  elements  are  assumed 
fault-free,  while  the  phrase  “leas  strict”  typically  includes  the  previous 
restriction  and  adds  the  assumption  of  fault- free  links.  For  specific  detaib  see 
the  appropriate  subsection  of  Section  5.2.  The  facts  and  reasoning  supporting 
Table  5.3  are  discussed  below. 

As  noted  in  Chapter  4,  the  choice  of  fault  model  and  fault- tolerance 
criterion  plays  a  key  role  in  determining  the  fault-tolerance  characteristics  of  a 
network.  ESC  fault  tolerance  is  evaluated  in  light  of  a  fault  model  that 
presupposes  the  possibility  of  failure  of  any  network  component  except  the 
stage  n  demultiplexers  and  stage  0  multiplexers.  (Stage  n  multiplexers  and 
stage  0  demultiplexers  are  treated  as  stage  n  and  stage  1  link  faults, 
respectively.)  As  can  be  seen  from  Table  5.3,  this  fault  model  is  stricter  than 
all  but  one  of  the  fault  models  of  the  comparison  networks.  The  ESC  fault 
model  a-ssumes  at  lca.st  as  many  possibilities  for  failure  within  a  network  (both 
switching  elements  and  links)  and  dire  consequences  for  such  failures  (any 
faulty  component  is  unusable)  It  may  well  be  the  most  realistic  of  those  fault 
models. 

The  fault-tolerance  criterion  for  the  ESC  is  the  same  as  that  for  most  of 
the  networks  surveyed.  Ra.sically,  what  is  required  is  that  one-to-one 
interconnection  capability  be  uncotnpromised. 
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Table  5.3  Comparison  of  surveyed  networks  with  the  ESC.  Entries  give  the 
relationship  between  the  network  in  question  and  the  ESC  as 
regards  a  particular  attribute.  “SE”  is  an  abbreviation  for 
switching  element. 


Network 

Fault 

F  ault- 

Routing 

Hardware 

Fault- 

Model 

Tolerance 

Complexity 

Complexity 

Tolerance 

Criterion 

Capability 

Modified 

less 

same 

similar 

slightly 

similar 

Baseline 

strict 

less 

Augmented 

Delta 

slightly 

less 

strict 

same 

similar 

slightly 

less 

similar 

Multipath 

Omega 

slightly 

less 

same 

similar  to 
slightly 

slightly 

less 

similar 

strict 

greater 

F- network 

less 

same 

similar 

slightly 

similar 

strict 

greater 

Enhanced 

slightly 

same 

similar 

greater 

less 

lADM 

less 

(straight 

links) 

strict 

Enhanced 

I  ADM 
(half  links) 

slightly 

less 

strict 

same 

similar; 
complexity 
hidden  in 

SE 

greater 

greater  if 
complex 
routing 
used 

Gamma 

less 

same 

similar 

greater 

similar 

strict 

Fault- 

much 

stricter 

much 

much 

similar 

Tolerant 

less 

higher 

greater 

Benes 

strict 

Augmented 

as 

same 

similar; 

much 

similar 

C- network 

strict 

complexity 
hidden  in 

SE 

greater 

i3-networks 

much 

much 

much 

less  to  ^ 

similar  to 
greater' 

less 

less  strict 

higher 

greater ' 

strict 

*  Using  the  fault  model  and  fault- tolerance  criterion  defined  for  that  network. 
'  Depends  on  specific  instance  of  network  structure. 
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For  most  of  tlio  lu'tv^orks.  routing;  in  the  presence  of  faults  is  htlle  more 
complex  than  m  the  absence  of  faults.  The  notable  exception  to  this  is 
.•i-net works,  (iivin  an  input  and  output  to  be  connected,  the  dynamic  full 
access  priicodure  r<'()uires  i  hosing  a  set  of  intermeiiiate  outputs  which  can  each 
be  reachi'd  consecutively.  pen(>ral  solution  to  this  problem  is  m^t  known. 
ITuiting;  complexity  for  the  F.'iult-Tolerant  Hcnes  network  is  higher  than  for 
tile  KSC  because  of  the  nature  of  the  Irenes  network  lOjilTl];  it  is  not  due  to 
tile  modificat  ic'ti  for  fault  tolerance. 

Hardware  complexitx  of  the  networks  varies  significantly.  The  measure 
underlying  the  preouitalion  in  Tahh'  '>3  is  a  mixture  of  the  asymptotic 
number  of  tietwork  components  needed  and  the  complexity  of  the  switching 
element  Tlie  intent  is  i..  get  a  rough  estimate  of  package  count  at  various 
levels,  specifically .  the  chip  ami  board  level  or  multiplmchip  carrier  level  (e.g.. 
the  IHM  Thermal  Conduction  Module  lBlIf82]).  The  obvious  use  of  such 
infi>rmation  to  a.ssess  implementation  cost.  However,  hardware  complexity 
and  implement  at  loti  cost  may  fioar  little  relationship;  knowledge  of 
imph'iTient  ,at  1  ui  di  tali'-  ari-ing  from  a  hardware  design  study  is  necessary  for 
linrh-conh dejwi.  estim.iie-.  of  co'-t,  \I.SI  technology  often  allows  a  large 
rt'dmtion  in  [uickage  count  for  example.  With  this  caveat,  the  hardware 
(•‘Unidexity  data  can  be  u-^ed  f,ir  comp.arison  purposes. 

The  fault -toll  ranee  cnpnlulit les  of  the  networks  are  all  reasonably  similar 
giv(  II  the  chosen  standard  Iw  which  each  nt'twork's  features  are  determined 
d'ho  !.  a;  i  arent  m  lie-  c,,|i]niii  on  fault-tolerance  <'aiiabilii ies  in  Table  '> -i 
’I'lerc  -li  'iiC  (.e  11"  '-nri'ri'-c  that  tin-'  o  so  It  is  ea.s_\  to  agree  with  the  idea 
iliat  It  I  d''-!r,ibi('  f-T  .1  mlW'-rk  I"  h.'i\<  \\hatever  fault -t  oleraiice  capahilities 
arc  fc.'i  it  !c  ^n;cle-f'uill  tolerance-  is  more  fea.sible  than  i-fault  tolerance,  i  >  I 


However,  because  each  network  is  studied  using  its  own  fault  model  and  fault* 
tolerance  criterion,  significant  differences  in  cajiabilities  might  appear  if  a 
common  fault  model  and  fault-tolerance  criterion  are  adi>pted 
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The  list'  fault  modf'l  an^i  fault- tolerance  criterion  can  1)“  ajiplied  to  the 
networks  surveyed  in  tfie  previous  section  in  order  to  relate  their  fault 
tolerance  to  that  of  the  I.SC  netuork  1'his  information  i>  given  in  the  first 
Column  of  Tafde  a  1.  I  nder  th<  HSC  fault  model  and  fault-tolerance  criterion 
only  the  use  and  A('N  are  single-fault  tolerant  Man\  of  ttic  networks  fail  to 
be  single-fault  tolerant  because  they  cannot  tolerate  an  input  or  output 
s\Mt>  hing  element  fault,  :ls  can  the  and  AC.N.  This  is  why  so  many  of  the 
f,i')lt  m  !c|s  refer  only  ti-  int(>rnal  switching  (dement  faults  If  the  {CSC  fault 
III  li  I  I-  iiineiided  to  .assume  fauli  free  switching  (deim  nts  in  the  Uijiiit  and 
oiiiput  stages  s,.me  of  the  networks  be,-,,nie  singlis-fault  tolerant  as  shown  in 
file  title  Alternat i\ td\  tln.se  sane,-  networks  that  are  sirigli^-fault  tolerant 
under  the  relaxed  fault  model  (culi  be  fitted  with  input  and  output  stage 
bypiiss  (irciiilry  ,and  be.-.'me  .single-f.auit  twhaant  without  we.akeriing  the  fault 
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Table  r),4  Fault  tdleraiire  capabilities  of  the  networks  using  the  FSC’  network 
fault  model  and  fault  tolerance  criterion. 

A.  Singb^fault  tolerant  using  KSC  fault  model  and  fault- 
tolerance  criterion. 

B  Single-fault  tolerant  if  10S('  fault  model  is  relaxed  to  assume 
ini>ut  and  outfuit  stage  switching  elements  are  fault-free. 


i  Network 

A 

B 

- - —  _  .  .  -• 

Fxtra  Stage  Cube 

yes 

yes 

.Modified  Baseline 

no 

.j 

yes 

1 

1  .Augmented  Delta 

no 

yes 

.Multipath  Omega 

no 

* 

F-nel  work 

yes 

Knhanced  l.\DM 

no 

no 

(straight  links) 

linhamed  lADM 

no 

yes 

(half  links) 

( iamma 

no 

no 

f  au  ft-d’olerant  ftenes 

no 

Ves 

■ 

.Augmented  (’-network 

yes 

ves 

■ 

S-net  w  orks 

no 

• 

yes 

Tvpi'  iilv  \i"-.  but  d'  p' leB  oil  specific  instaiieeof  network  structure. 

L 

\1i:  I  III'  lifv  (  oiiirol  s.  Itenie  to  achieve  fault  tolerance  under  this  model 


The  Gamma  network  is  not  single-fault  tolerant  under  the  relaxed  model 
because  it  has  only  one  path  from  an  input  to  the  identically  numbered  output. 
The  I’SC  fault-tolerance  criterion  is  full  access,  .so  a  straight  link  fault  will 
prevent  an  input  from  communicating  with  the  identically  numbered  output 
(as  it  would  in  the  l.M).M  network  on  which  the  Gamma  network  is  based). 
Thus,  the  Ciamma  network  does  not  satisfy  the  I'SC  fault-tol'^rance  criterion  of 
full  access. 

The  F’ault-Tolerant  Benes  network  is  capable  of  single- fault  tolerant 
operation  under  the  relaxed  KSG  fault  model,  .\lthough  faulty  components 
cannot  be  used  to  piuss  data  under  the  ESC  fault  model  (unlike  the  Faults 
'lolerant  Item's  fault  model),  only  (■lu'-tfvone  connections  need  be  supported  (as 
compared  to  permutation  connections  for  the  Fault-Tolerant  Itenes  fault 
mode!)  'J'he  f'aulf-7  olerant  Bene.'  network  c.an  jx'rfnrm  any  one-toone 
connection  without  using  a  given  faulty  component.  However,  the  control 
method  given  in  ISoBSOj  must  b-'  modified  to  achieve  this  fault  tolerance 
capal'ility  so  that  f.iiilty  network  eotiip<inents  are  avoideel  (the  given  algorithm 
e-cs  f:iiilt\  conipoiieni'-l 

1  111-  \(  .N  I-  f.oi l!  I  ih-ran(  m.'i- r  tin  l.^t  fsult  neide!  and  f.aul-l- 

I  -I  r  o  I-  I  riti  r  loll  II  r  tio'  '■■iji.ihiliiy  comes  at  the  firice  of  significantly 

It.  ri  .01  I  hardw.ire  c  mplexa  ,  •\(  A'  -j  x  1  cr  -ssbar  switching  elements  .are 

III  -f'  -  -iiip!  \  ih.'iii  ;h'  in  1 1  1  >  li  ing'-  usi-d  ;r,  the  !■].''('  and  must  haxe 

irMi.'rv  lii  ,1  Idiii  'ii  -o  ih.it  m.  !• -i  '.  ••ijp,.rt  ih.'  r-'-i'iig  s.henu-  l  urthn, 

tin  \(  \  h.-'.-'  l.si'i  a.'  I’l'oo  l-nk  .i-  ihe  1  St  lor  <  a  m  small  network  -n.zes, 

I  io  j,,  .-III  i.'.-i:  i'-.’i  ■  at  t' ;■ 'ti  -ot  ml '  a  11  i  ago  o\  er  t  f:  '  .-Xt  A 

\  .■  n'  t>'.-:k  !'  ii'c i"  f.'-.nl'  io|.  :.o,i  -jvi-,.  m..  f.i-t\cd  1,'sf'  fani;  inoiie'  if 

it  pri  \i  j.s  -ii  j.'.i.--!  ieo,  j..i!),,  ill'll  h;ire  ic  netw...-k  e  -riipoinents  other  th.an 
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thrir  first  and  last  /^-olonionts  botwoon  any  source  and  any  destination.  This 
restriction  must  hold  for  all  passes  through  the  network  that  may  be  required 
for  the  two  paths. 


5.4  Conclusions 

Nine  fault-tolerant  multistage  interconnection  networks  have  been 
described.  The  particular  characteristics  considered  included  fault  model, 
fault-tolerance  criterion,  fault-tolerance  method,  single-  and  multiple-fault 
tolerance,  routing  complexity,  and  hardware  complexity.  With  this 
background,  the  networks  were  compared  with  the  ESC. 

There  are  wide  variations  in  topology  and  switching  element  design  among 
the  networks  surveyed  The  Augmented  Delta,  Modified  Baseline,  and 
Multipath  Omega  networks  are  cube-type  networks.  The  Augmented  Delta 
and  Modified  B.aseline  networks  each  incorporate  an  extra  stage  of  switching 
elements  to  provide  redundant  paths.  The  Multipath  Omega  may  use  an  extra 
stage  or  stages  of  switching  elements,  or  substitute  a  stage  or  stages  of  different 
switches,  or  use  some  combination  of  these  methods  'I'he  ('.amma  network  list's 
3xd  crossbar  switching  elements  and  the  same  link  connection  pattern  as  the 
1,\1)M  'I'he  1  jihanced  l.\I)M  and  F-network  use  x  '>  and  4x4  switches, 
respectively,  which  pass  on*'  item  at  a  time  The  .Augmented  C-network  us<?s 
4  X  4  crossbar  switching  elements  with  a  general  family  of  topologies.  The 
Fault-Tolerant  Benes  network  and  ."l-net works  are  both  composed  of 
elem'’nts.  but  have  different  link  jiatterns 

'111*'  I’SC  network  provnb"  single-fault  toleratic*'  desfule  its  challenging 
fault  mode!  and  faiill-to|er;in<  <■  'nterion  I  h*'  fault  mod*  ]  and  fault-tolerance 
criterion  wore  chosen  for  their  *  i>nsist  ency  with  engineering  pragmatism.  It  is 
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CHAPTER  6 

RELIABILITY  OF  THE  EXTRA  STAGE  CUBE 
INTERCONNECTION  NETWORK 


6.1  Introduction 

'i'hc  term  reliabiltty  has  many  diffortMit  technical  meanings;  one  that  is 
commonlv  accepted  i>  that  the  reliability  of  a  component  or  system  at  time  t, 
denoted  lv|t),  is  given  by  R(t)=r(T>t),  where  1'  is  the  time  to  failure  and 
l’(d'>t)  is  the  probability  that  T>t  [MeyTO]  Time  to  failure  can  be 
eoii'-idered  a  continuous  random  variable  of  some  probability  density  function 
'I'he  failure  Ix'havior  of  components  <'an  be  studied  and  a  probability  density 
function  chosen  to  rej>resent  that  liehavior  as  accurately  as  possible. 

An  analogous  reliability  m(>asure  for  interconnection  networks,  termed 
trnmnal-patr  rrliabiltty  [Hal’Slj  has  been  applied  to  the  Gamma  network 
'l’aH''‘J  raK’slj.  Terminal-pair  reliability  is  the  probability  that  there  exists 
at  least  one  path  between  any  network  input  and  output.  This  reliability 
iiien.-iire  is  appropriate  for  the  I'.SC  as  well  as  the  Gamma  network,  l  or  the 
ESC  it  can  be  (.ilciilated  as  one  minus  the  probability  of  loss  of  fault-free 
int'T  ■  .nil"!  tMii  capability.  In  the  ca^e  of  one  fault,  the  ESC  was  shown  in 
('ha|Uer  1  !■>  h:ne  .i  xero  prot>abillty  of  such  h'ss  of  cajiability,  hence,  a 
r'  li:il'i!it\  ■  f  .mM"  riic  C'l.il  hi'ro  i-  ti«  assess  multi}d<'-fauli  tolerance  of  the  ESC 
!  Xd^'slC 
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6.2  Fault  Model 

'Fill'  f:iult  miHii'l  chosen  for  the  KSC  provides  a  strong  challenge  to  fault 
tdh  rari'e  analvsis  because  it  supposes  both  boxes  and  links  fail,  and  fail  with 
ditb'reiit  likelihoods  The  coni{)lex  nature  of  a  physical  network  Ls  thus  more 
accurately  captured,  but  incorporating  this  complexity  in  an  analysis  Ls 
correspondingly  more  involved. 

As  with  any  fault  model,  the  accuracy  with  which  reliability  predictions 
rnalch  riic.-Lstired  n  lialiilitv  is  the  touchstone  for  the  validity  of  the  model.  The 
.assumptions  made  for  the  model  used  here  are  more  stringent  with  respect  to 
the  an.alysis  to  be  presented  than  those*  typically  used  for  similar  problems, 
d'h  t  ‘  I  H  tSS  ibilitv  of  link  failure  is  often  ignored  in  the  literature  on  fault-tolerant 
till! It  ist .age  itif ('rcorineet ion  networks  [.\grS2,  (  PaR82.  ShllSO,  SoR80, 

WFl.s-jj,  fault  toleramc  atialysis  [S1h‘X21,  and  network  relialdlity  [(’iS,'^2]. 


6.3  Study  or  Multiple-Fault  Tolerance 

lli  f  ill...  st  lee  ii  .and  0  mteri  hatige  boxes  are  byp.xssed  when  faulty,  an 
:i;i.il\  i'  i  f  lie  probabilitv  of  loss  of  fault-free  mterconnection  capability  of  the 
I  s(  ,  III  f,,  n  il  iir.i'lv  ^trueturei  m  four  ■  a.-^es  based  on  cF-lsscs  of  {.ihysical  fault 
1' r  uj  I!;’'  f'Uir  X'es  are  clos'd)  rebated  to  the  li\e  fault  categories  listed 

in  I  iM'  t  i  I  lie  First  case  is  derived  joint!)  from  the  first  and  second 

i'  j'  ’-]'  -  'f  ili.it  I  it'll'  I'll  ■  reinaitiing  three  cases  corr'-'pond  to  ttie  third, 
f'  ijr’h  .am!  liitli  lao-g'iries.  r*-  iv  el  \ 

I  /  I'l'  the  total  number  i,f  b- .aiul.-’or  bt.k  fault-  posent  in  an  FISC 

m  'u  rk  for  ilu'  fir-'  'sa-e  . . .  all  f  f.aults  ,iri  -tac'  n  bo\  faults  or 

ill  f.oilt-  .If  :  iL’.'  0  bo'x  f.iult-  1  .ni!t-fr>-e  int er'-on ne.  1 1.  .n  eapability  is 
rit.oii'  i  III  ttii-  -il'i.ati  n.  .ilthoueh  onK  '■n'-  ['.alh  rem.iin-  beti,\rcn  aii)  sourr*' 


1 


ji 


1 


.•‘m 


ill  i  i'  -I  Hi  it  i<  ill  ill '  Hii-"  thf  stric:'’  with  ;ill  of  the  fiiults  w  ill  tx'  (iiMifilcd,  in  tfu‘ 

■  i  ii'i  lit  !'';u'-t  mic,  hut  in't  .hII  f.  friiilts  .'ire  ii  Ixix  fiiiilts.  Fault-free 

■  Ml'  'I  'fi  <  .ip.-iliiht >  lx  I'ixt  III  this  cireimi.t iiiu'i-  heeruise  staRe  n  is 

'  i'  ‘  .li  iiriiic,  that  the  adiiitimial  fanltf'.)  miixt  aff'ct  the  single  path 

‘  and  deat  mat  ii  ill.  .'\n  anal'>gi>ns  sitiiaiion  exists  if  at  least 

;  •  !,m;  a  :,;i  f  faults  are  stage  {)  i>(i\  faults  FinalK,  if  all  f  faults  are  in 

ii'  ti'.i  rk  .riipcm  nt'.  iither  tlnn  stage  ii  or  0  boxes,  then  I  heorern  4.()  ran  he 
U'-i' !  !■  i‘ !•  r'liitie  \v  tr'i t)«-r  fault-free  intereonneet ion  eapatulity  is  available. 

d  l;x  uni'ti  i  d  th'  set  •.  of  pattern  of  faults  associated  uith  these  four  eases 

o  tk . .  -  if  ad  ;  1'  -dd"  o-currenei's  of  faults  in  the  Figure  fkl  illustrates 

tti"  ri  :,it  !■ '!i-hi['-  arii'ing  the  eases.  Fh  e  set  of  all  possiide  oiiteotlK's  of  an 

•  \  j 'ri'ie:;;  :•  nr7i!  spare,  of  that  e\  peritiu-iit  |Me;'d()’,  hence  Figure  til 

lllu- ' 'at  I' .  it,,'  event  spaee  of  fault  oecurreliees  for  the  i;^('  I'he  lirst  ca.se  is 

r>  pr"  ■■  lit  e  ;  to  ihe  ymtwls  ■  n"  and  O'  'i'he  seeond  aiiil  third  east's  intersect, 

T' pr I'liiiu,  -itu  iti  li-  wi'h  1,,  stage  n  bo\  faults  aiiii  I'd  stage  0  box  faults, 

!  <  b.  t  f  in  !  b,  ^  b,,  <  f  d'hese  ca.ses  are  df'iu'ted  by  the  two  directions 
'  f  i;ig  ’’ii  liiii:  bh'  la.st  -  as*-  i^  indicated  by  tlu'  i-Ttiea!  shading  Foss  of 

!  I  ■  i  I '  -  f 'i  1’,'  r.  iiii.i  Ill'll  I  ipabiiity  IS  assoeiateij  vsith  those  physical  f.iult 
or  upin.o  ;(i  it  f  iii  aithin  the  r'"gioi!  of  the  (liagraiii  enclosed  by  the  bold  line, 

\  ibc:  o  ,  s  ,1,  ,)  I,,  a  .•.■Lse  I-  not  intended  'o  be  jiroport Iona!  to  the 

iinnit  e-  .f  f.od:  vT' lUpiiig'  littmg  the  (Icseript ion  of  that  ease  but  rather  to 
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Case  4:  j 


A!l  faults  in  stage  n  boxes 

All  faults  in  stage  0  boxes 

At  least  one  but  not  all  faults 
in  stage  n  boxes 

At  least  one  hut  not  all  faults 
in  stage  0  boxes 

All  faults  in  network  components 
other  than  stage  n  or  stage  0  boxes 
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TIk'  last  term  covers  all  other  ways  fault-free  interconnection  capability  can  be 

lost . 

It  is  useful  to  establish  some  basic  relationships  before  proceeding  with  the 
e\  ablation  of  r(loss  |  N,f)  . 

P(box  fault  I  a  fault)  =  1  -  P(link  fault  |  a  fault), 

since  a  fault  must  be  in  either  a  box  or  link.  Note  that  strictly 

P(tiOX  fault  I  a  fault)  /  P(box  fault  j  a  fault). 

because 


P(  box  fault  I  .N',  a  fault )  = 


_ number  of  boxes _ 

number  of  boxes  and  links 


*  P(box  fault  a  fault) 


n  +  1 


3n  + 1 


a  fuiKtioii  of  .N  (n  -logN).  However,  the  ratio  of  the  number  of  boxes  to  the 


number  of  boxes  and  links,  rapidly  approaches  the  value  —  for 

3n  + 1  '  3 


increasing  .N.  Hecausi'  of  this  behavior  in  the  limit,  and  the  assumption  that  all 
boxes  liave  the  same  reliability  regardless  of  other  circumstances,  the 
appri 'X  imation 


P(box  fault  j  .N,  a  fault)  -  P(box  fault  j  a  fault) 
is  u-^ed  t  '  simplify  the  analvsis  .^imilail>.  th**  apiprox imat ion 

l’(iink  f.iiilt  i  N.  a  fault)  —  P(link  fault  j  a  fault) 


IS  used  Note  that  if  stage  0  \sere  a,ssnm«‘d  to  have  links  (which  would  connect 
it’.e  ,'i"t\xork  to  Its  output  jiorts),  then  tin*  two  at'ove  aj'j'rox imalions  )>ecome 


exact  rolat  lori'-hij's  Tor  notational  compact  ness  i'(t)ox  fault  |  a  fault)  and 

rilink  fault  I  a  fault)  will  hereinafter  he  written  f’(HF' [  F)  and  f’(LF[F). 

respect  i\<>lv 

let  A  he  ;in  event  and  A  its  complementary  event  That  is,  A  denotes  the 
set  of  all  possilde  ou'conies  other  than  those  in  A.  Assume  an  experiment  is 
repeated  k  times  with  F(A)  constant  for  all  reiietitions.  Define  x  to  be  the 

numher  of  times  A  occurs.  Then  x  is  the  binomtal  random  variable  with 

parameters  k  and  F(A),  and 

Fix  =i)  =  (^).F'(A)*F‘^  ‘(.A)  =  (^)*I*'(A)*(1-F(A))''  i  . 

f(^r  i  =0.  1 .  k  [Mev70[.  Recall  M  =  -  "  .  for  a  >  b  >  0,  where 

tuj  (a“b)!b! 

0!  =  1  FA  definition  p.*  =0  if  a  <  b  or  a  <  0  or  b  <  0.  The  number  of 

ttij 

boxes  in  tiie  network  is  .\(n-f  I)/2,  so  0  <  b  <  .\’(n  +  l)/2  (where  b  is  the 
number  of  box  faults):  the  number  of  boxes  and  links  is  N(3n  +  l)/2,  so 
0  <  f  <  \{3n  +  l)/2  Combining  the  relationship  for  the  binomial  random 
variable  vsitli  the  approximation  F’(box  fault  |  N,  a  fault )  =  l’(FiF|F)  yields 

the  f  llowinr; 

F-lb^jjN.  f)  l’(b=:j!f) 


f 

•  VHUV  1 1 

^  ) 

')  *  (1  -1‘(F1F  j 

*  FuFF  I  Ft*  F^  dl.F  I  F)  (f).2) 

fc  if  0  ■'  j  \!n  t  l)/2  and  0  f  <  Nb'tn'i  II/2. 


6.3.1  Terms  with  Input  or  Output  Stage  Box  Faults 

'Fcrrns  in  the  expression  for  Pfloss  [  N,  f)  can  now  he  considered.  The  first 
term  is  considered  first 


Theorem  6.1: 

I'd  <  h,  <  fiN.f)  =  y  i>(h„=jiN,f) 

J  =  1 

f  0  ,  0  <  f  <  1 


N72  .Nn/2 

f  J  M  .  ,  f  ,  N 

(N(n^?)/2)  ‘|jj*‘W|F)*PMLF|F),  l<f<- 


Nn/2 


'  -  -  Id-  '’''“'I ‘'■-"I  ■  t  < '  ^  T 


f'rnof  I'lie  case  of  0  <  f  <  1  is  givc'n  for  completene.ss. 
Consider  I  <  f  <  N72.  Clearly, 


l’(l  <  h„  <  f  N.  f)  =  V  p(i  <  b„  <  j  N.  l)=j)  *  P(b=j  N.  f) 


for  I  <  f  <  N/2.  From  I'c^uation  (>  2. 


’(b-j  I  N,  f)  --  ^  .  pjfpp  !  F).pf  i(FF  j  F)  , 


and  for  .a  fixed  i 
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P(1  <b„<jjN,  b=j)=  ^  P(bjN.  b=j). 

b„=l 

W 

For  a  given  b^  there  are  ^-b*”  Possible  fault  configurations.  The  factor 

/  i 

represents  the  number  of  ways  b„  faults  can  be  located  in  stage  n  boxes, 


while  '-^y  enumerates  the  ways  j-b^  box  faults  can  fall  outside  of  stage  n. 
J  ‘’n 

Note  that  denotes  the  number  of  ways  j  box  faults  can  occur  in 


the  ESC  network  without  constraint.  So, 


P(b„  N.  b=j)  = 


N72  Nn/2 

K  j[j-bn 

'N(n  +  l)/2" 

j 


Thus. 


P(l<b,<f|  N 


N/2  Nn/2 

r  b^i  ,  fr) 


For  N/2  <  f  <  Nn/2  it  is  always  the  case  that  b^  <  f.  Thus,  any  pattern 
of  faults  with  at  least  one  fault  in  stage  n  contributes  to  this  case.  It  is  easier 
to  count  those  fault  distributions  which  do  not  contribute,  i.e.,  those  with 
bn=0.  Formally, 

I’d  <b„  <  r|  f!  :r  1-  ;yM'(b„=0|  N,  b=j)*p(b^j  I  \.  f) 

J=0 

for  N72  <  f  <  Nn/2  'I'here  are  j^”|  ways  j  box  faults  can  all  occur  outside 


of  stage  n.  So, 


Thus. 


r(b,=o|  N,  b=j) 


|Nn/2j 
N(n  +  l)/2 

j 


rMl<b„<flN,  f)  =  l 


Nn/2 
,  t  J  i 

N(n  +  l)/2 


.  *PHRFlF)*pf'HLFlF) 


for  .N72  <  f  <  .\n/2. 


□ 


Corollary  6.1:  r(l  <  bo  <  f  j  N.  f)  =  P(  1  <  b„  <  f  |  N,  f)  . 


1‘roof:  The  corcdlary  follows  from  the  structural  symmetry  of  the  ESC. 


□ 


Theorem  0  1  and  Corollary  6  1  address  the  first  two  terms  of  the  equation 
fi'T  P(lo.ss  I  N,  f).  respectively.  The  next  theorem  considers  the  third  term  of 
that  e(juation. 
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Theorem  6.'. 


P(bj^  >1  and  bg  >1  N.  f) 


0  ,  0  <  f  <  1 


\72  N/2  N(n-l)/2 

•r;  N(n  +  l)/2  Vji 


N(n  +  l)/i 

j 


'  *pj(BF|F)*pf  j(LF|F)  , 


1  <  f  <  N 


Proof:  The  case  for  0  <  f  <  1  is  trivial.  Consider  now  1  <  f  <  N. 

f 

B(bn  >  1  and  bg  >  1  |  N.  f)  =  ^  ^  H  *>~i)  *  B(b=jj  N,  f) 


From  Fquat ion  0.2 


P(b  =  jl  N.  f)  =  .  *PJ(RF  lF)*pf  J(LF  |F)  . 


For  a  fixed  j, 


I’(b^  >  I  and  bg  >  1  |  N,  b=j)  -  ^P(b„  >  1  .  bg  >  1  ,  and  b^  +  bg-i  N,  b-j) 


V  /  o 

The  expression  is  the  number  of  wavs  b,,  box  faults  can  be  located  in 

I  "  ) 

\ /'I  N(n~ll/2 

stage  n,  “  is  the  analogous  expression  for  bg,  and  enumerates 

>'o  J 

/  V 

the  ways  j-^llq^  +  bg)  box  faults  can  occur  in  stages  n-  1  through  1.  Now,  given 
the  constraints  b^  >  I,  bg  >  I,  and  b^  +  bg  <  j  there  are 


•  I 
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U  [N72l(.\72l[N(n-l)/2 

pt)ssil)k'  fault  configurations.  The  limits  on  the  double  summation  are  such 
that  all  possible  b„  and  b^  values  meeting  the  constraints  are  generated.  The 
number  of  ■ways  j  box  faults  can  be  placed  in  the  network  without  constraint  is 


PJ(HF  I  F)»rf  MFF  |F) 


□ 


I'lie  fourth  trrm  of  the  general  evpiatnm  for  F(l<<ss  !  N.  f). 

F(bj  -f  bp  -  0  and  loss  |  N.  f) 

i>  iii'ire  ditlicult  to  (oalii.ale  than  the  first  three  In  the  next  section  it  is 
anai\/'d  for  f  2  and  the  results  c- >nibin<'<l  with  the  work  of  this  section 


evaluated  for  f  -  2,  to  yield  l’(loss  N,  2) 
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6.3.2  Term  without  Input  or  Output  Stage  Box  Faults 

Not  all  fault  groupings  satisfying  the  constraint  b,,  +  bQ  =  0  result  in  loss 
of  fault-fr(‘(‘  intfrconnertion  capability.  This  is  a  direct  reflection  of  the 
robustness  of  the  ICSC  in  the  face  of  multiple  faults,  and  so  is  a  positive 
attribute,  neterniining  the  fourth  term  of  the  equation  for  P(lossj  N.  2)  is, 
consecpient !y ,  invohed. 

A  lo/>sy  pair  is  a  fault  pair,  or  pair  of  faults,  that  causes  a  loss  of  fault-free 
interconnection  capability.  A  lUt-fault  pair  (box-box  fault  pair)  is  a  fault  pair 
consisting  of  two  faulty  interchange  boxes.  An  LB-fault  pair  and  an  LL-fautt 
pair  involve  a  faulty  link  and  box,  and  two  faulty  links,  respectively.  The 
terms  ‘  l.B-fault  pair"  and  “HL-fault  pair"  are  interchangeable.  Lossy  pairs  can 
be  analogously  cl.xs.sifled  a.s  BB-lossy  pairs,  LB-lossy  pairs,  and  LL-lossy  pairs. 
The  ntitiiber  of  !o,-.>y  jtairs  in  an  ESC  network  is  a  function  of  N.  Figure  6.2 
shows  an  liSC  network  for  N  =  8  with  a  faulty  box  and  indicates  by  broken 
lines  all  netwfirk  componimts  that  can  form  a  lossy  pair  with  the  faulty  box. 

To  determine  the  numbers  of  the  types  of  lossy  pairs,  consider  the 
suflicient  Condi' ions  for  a  ftuilt  pair  to  be  a  lossy  pair,  or  lossy.  Let  the  two 
faults  be  in  stages  i  and  j,  1  <  i.j  <  n,  and  assume  without  loss  of  generality 
that  j  <  i.  If  (i.a^  |  a,a^l  and  (j.  b„  j,..b,bo|  are  fault  labels  such  that 
a„  j-.  a^-ia,  =  b^  ,..  bj.^Jb|  and  Rj  |...a|ao  =  b^  |...b,bo,  then  the  fault  labels 
denote  a  lossy  pair  since  there  will  exist  source/destination  pairs  with  no  fault- 
free  connecting  fiath  (se<>  Corollary  4. 1). 

N<'vk  consider  the  numtx'r  of  stage  j  faults  which  form  a  lossy  pair  with  a 
gi\en  st.age  i  fault  One  c;i.se  is  when  i~j  The  constraints  for  loss  of  fault- 
free  interconnection  ca['abi!ity  are  then  a„  |...a, +  pa:  =  b^  |...b|  +  |b|  and 


i;S('  liftvsork  with  N  ^  S,  a  Riven  faulty  box  (indicated  by  the 
bi'l(i  liii'-j.  and  all  network  components  in  stages  2  and  1  with 
which  it  can  form  a  lossy  pair  (indicated  by  the  broken  lines). 


1«4 


:ii  ,  -  b.  ,  l',hQ  Only  if  tho  fault  labels  are  cornplonicnf ary  in  hit 

j>ositinn  0  ami  iiiatrh  in  all  other  bit  positions  is  such  a  fault  pair  (i  — j)  lossy 
rtiim  ili'Te  is  i  \;,(tly  'Mie  bo\  aiul  one  link  in  stage  i,  1  <  i  <  n,  that  can 

b  rn;  a  I  [  air  with  anotlier  given  stage  i  fault.  For  i  -  n,  tliere  is  one  link 

that  ,in  f'lrni  a  h  .ssy  pair  with  another  stage  n  link.  I'his  implies  there  are 
\  1  li!bl..s,\  [  air  in  an\  cn  en  stage  i.  1  <  '  -'>•  bt'caiise  any  given  box  in 
stages  n  -  1  through  1  can  form  a  lossy  pair  with  one  other  of  the  N/2  boxes  in 
Its  stage.  .\l  so.  tliere  .are  .N  I  lb  lossy  pairs  in  stage  i  I  <  i  <  n,  for  this  rase 
since  each  of  tin  \  links  in  a  stage  can  form  a  lossy  pair  with  one  box  in  the 
sami'  stage.  Finally,  there  are  N’/2  1,1. -lossy  pairs  in  stage  i  (including  i=n) 

bet  ause  an>  given  link  can  form  a  lossy  pair  with  one  other  of  the  N  links  in 

the  same  stage, 

.•\  second  c.asc  is  when  i  /  j,  and  without  loss  of  generality  it  is  assumed 
that  i  >  j.  'The  constraints  for  lossy  pairs  are  a,,.  ,...3,  +  ja,  =  l)n.,...b,  +  )bj  and 
Hj  1-  .a,;to  =  I'j  ,  bibo.  In  other  words,  bit  positions  i-1  through  j  are 
unconstrained,  or  need  md  match.  These  i-j  unconstrained  bit  positions  allow 
•J  j  b'  \cs  in  stage  J  to  form  HtFlossy  pairs  with  any  given  box  in  stage  i, 

1  <  1  " ;  ii  Fm  h  I'ov  III  -tage  i,  1  <  i  <  ii.  can  also  form  a  BF-lossy  pair  with 

2  »  2'  stage  J  link^^  d  he  fa'  tor  of  two  accounts  for  the  two  stage  j  links  which 

■an  form  a  Kesy  pair  with  a  stage  i  box  for  every  stage  j  box  that  can  form  a 
lo-s)  f  air  with  that  stage  i  box,  Tfmre  are  2'  ‘  *  stage  j  boxes  that  can  form 
111  -lov  %  pairs  with  any  given  stage  i  link,  1  i  5r  T'he  minus  one  in  the 

•'\['  'll.  nt  I-  due  1,1  the  fact  that  (here  are  h.alf  many  stag''  j  boxes  which  can 
f ’O  ’  :i  i  '--'.  p'l.r  ith  1  --t.’ige  1  link  .'i.''  can  form  a  lossy  [  air  with  the  stage  i 

i’o\  ,(,,”1  laied  v\i!h  t  !i  u  link,  |  \  st  ,agc  i  t”  x  his  two  iLssociated  stage  i  links.) 

1  inaby  2  ^  stage  j  links  can  f  irm  FF-lossy  pairs  with  any  given  stage  i  link, 


I 


I 

I 

I 

i 

i 

I 

I 

I 

I 


1  <  i  <  n,  due  to  tho  i-j  iirifonstrainod  hit  positions. 

Since  f  =  'J  wtis  assunied,  there  is  only  one  fault  pair  in  the  network.  Let 
the  nuniher  of  items  of  a  type  T  he  r/(T),  For  Ifli-los.sy  pairs 

lossy  pairs j  !>„  +  hg  “0.  .\)  =  — (n-1)  +  ^  ^  ^ 

^  ,=2  jTi  - 


I  hi'  first  term  on  the  right-hand  side  of  this  equation  is  the  number  of  Bl3-lossy 
pairs  m  which  both  faults  are  in  the  same  stage  i,  1  <  i  <  n.  Lach  term 
resulting  from  the  double  summation  is  the  numl>er  of  IJlLlossy  pairs  for  a 
given  1  and  j.  i  -^J  All  possible  pairs  of  i  and  j  valm-s,  given  the  assumption 
1  <  j  <  1  <  n  are  included  I  his  equation  simplifies  to 


»/|  1  Ul  -  lossy  pairs  h.^  +  Fq  -0,  N) 


2N*  -  3\n  -  N 


For  LiFl  ossy  pairs 

e  1 

f/(l  H  lossy  pairsj  h,_  +  h(i  =  0,  .N)  c  ,N(n-l)  +  V  ^  ‘  ,N  4  (6.4) 
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LlV-l(!s^y  p;ur^  with  a  faulty  stage  i  link.  Kquati'ui  6  1  can  be  simplified, 
y  i<‘Mnig 

1/(1  H  l'»v,y  ['.'ir'''  +  bg  ”0,  N)  -  ‘2\~  "  L’Nn  -  2N 

I'hc  number  of  1  l.-h'ssy  p.sirs, 

r/(I,l.-h'ssy  pairs:  b^^  4  bg  =0,  N)  =  4  V  ‘>'4  N  ,  (6  a) 

1'he  first  term  represents  th  -se  lossy  j)airs  of  this  type  with  both  faults  in  the 
s.anie  stage,  (i  e  ,  ■  “  j,  I  <  i  <  n).  Mach  term  of  the  double  summation 
fienoti>s  the  iiumt'er  of  I  I, -lossy  pairs  for  a  given  i  and  j,  i  j  I'he  expri*Hsioii 
Sim[difie>  a.s  follcWs 

f/(l,l.- lossy  pairs]  b^  +  bg  =  0,  Nl  =  2N'  -  -  ‘2N 

\t  this  point  the  numlx'r  of  Hit-lossy  pairs,  l,lt-|ossy  pairs,  and  IvL-lossy 
pairs,  all  not  involving  any  stage  n  or  0  box,  is  known  Let  n  represent 
b;,  bg  =  0"  fiT  eonejvi.iie'.s  m  the  statement  of  the  fol|i>uing  lemma. 


/  rruma  6  I 


1  r(lo-s!  o  N  nit  fault  i-air  (f-L’))  = 


ossv  ^ir-yl  o^  .N ) 
»;(nn  fault  pairs]  o,  N) 


4  i-jn:  -  ,-?N„  -  N| 


2.  P(loss  a,  N.  LH-fault  pair  (f-2))  - 


3  I>( 


OSS  O 


N.  1-1  -fault  pair  (f-2| 


-lossy 

pairs 

o,  N) 

r/(PB- 

-fault 
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N) 

2N-  - 

2Nn 

-  2N 

1 

a 

y. 

l)/2) 

losi^ 

pairsj 

o. 

f/(LL- 

-fault 

pairs 

ft. 

N) 

2N-  - 


3Nn 


-  2N' 


iNnl 
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/  r,  o/  uuiiit.cr  uf  liB  fault  [iair>  x'.itli  tit)  fault)  m  stagt>>  t'  or  0  i.s 

N(n- 11/21 


1  <■  the  tiuinhcr  of  ua_\'  t<  sulci  I  two  boxes  from  the  N(n-J)/2  not 
III  '■la'.os  n  or  0  l,i|uation  5  3  pixes  the  nuiiil/cr  of  Hli-lossy  pairs  with 
t'.  *  i ai  1*  1  hf  tiunil‘)T  ttf  l,B-fault  pairs  not  contaiiiiiig  a  stage  n  or  0  fault 


I-  iNtii*  p  rticrc  arc  Nn  links,  any  one  of  which  can  be  chosen 


i' I"  ii'b  nt  j\  "f  ail)  oi.f  of  I'.lfi  l)/2  boxes,  l/quatu  n  54  gives  the 
ro  i  l.iia  Iiuo.t  cr  of  I  H-l  .  ■,  purs  I  inal!>,  1  1  -fault  pairs  number 


ii'i  I  '|iiatioii  t;  cioi jiicrat <“i  1,1  -los.-v  pairs 


(ii'.i'ii  iv'.t'  ill  an  1]SC  notwork 


Levifiia  (• 

I'ii  ,  •  !  0  \  i!H-faiilt  pair) 

*  I'd  "  ()l  2  1  il-faiilt  pair) 

•  >v  0,  N  2,  I  1  -faiill  pair)  ^  1 

t'r''  '  I  T  ■  i  k  Suit  ;\i'  t)ic  jil'OVf  ocpiations  enuiiM'rate  the  ways  a  g^iven 
pur  t\p<  u;  .  ir;;,-  rui.j  .-ati'^f)  the  oonstraiiil  1),^  +  hp  =  0,  diviiied  by  the 
I  'ui'i'"  ’■  u  pv'.  u  (  ui  ..'.  ur  For  l.lFfanlt  pairs  onlv  tlie  faulty  box  is 
r''l'  'v  :ui’  t-'  2"f'Tr-iiti;,'.'u’’  t?"-  '  •ilu'’ of  b„  -t  )>q  =  0. 

n 

I  II*  f"a'  .  t  iht'  s.-,  tion  is  to  obtain  an  expression  for 

•  t  ,  •'  :i’  i  !•  ■  "'i  2'  I’lloss]  N.  2)  ran  t>e  e\  aluat's'! 

/S'  r. 'M  >'  ill',  i-i'  ’S'  fa':!’',  in  an  )'S('  network 

F'  i  .  *  i  '•  '■  '  !  ■■  o  2 1  F(b„  -f  Ip,  -'0,  loss,  fault  pair  ty[H'j  N,  2) 


-j[2N"-3Nn-N] 

[^(n-l,/2J 

’p(^  +  I)/2j 

2N“-2Nn-2N 

[N(n-l)/2j 

(Nn)(N(n-I)/2)  *  "j 

N(n  +  l)/2  * 

1 

*  P=(BFjF)  + 


I''(LF1F) 


i’roflf:  Fach  of  the  throe  summation  terms  can  be  written 

I’(l')ss|  1).,  +  bg  -0,  fault  pair  type,  N,  2)  • 

F’(l)n  +  I'o  =0|  fault  pair  typo,  N,  2)  »  Pffault  pair  type]  N,  2)  .  Lemma  6.1 

gives  the  first  factor  of  this  expression  and  Lemma  6.2  the  second.  The  third 
I'-rni  ( ,111  be  cduputed  using  the  relationship 

Pix  box  fiu!ts|N,2)  -  (“j»P‘(HFiF)»P“  ‘(LFjF) 


Mbin  X  2  for  HlV-faull  pair-',  x  -  i  for  Llf-fault-pairs,  and  x  -  0  for  LL-fault 
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6.3.3  Solution  for  Two  Faults 

Theontn  0  I,  ( 'urollary  0.1,  Theoroni  6.2,  and  'rhnorem  6.3  give 
('\ffr»'.-si!)ns  for  orich  of  the  tmtis  of  P(k>Rsj  N,  2).  uilli  |>ararn<-tcrs  N,  P(FiF|  F). 
and  P(I.  F  |  F ) 


Proof.  1  ho  rosnlt  follow^  diroctly  from  tho  substitution  into  Equation  6.1  of 
terms  derived  from  '1  h‘'or"m''  t’>  1.  G  2.  6.3,  and  Coroll, ary  6.1.  given  f  —  2. 

□ 

The  e.i.'fhciriits  of  the  equation  for  P(loss|N.2)  in  Theorem  6.1  have  been 
verifietl  by  direet  enumeration  of  io.s.sy  pairs  for  N  1.  16,  .32,  and  61.  See 

Appeiidiv  .\  for  the  eompiiter  program  used. 

6.4  Conclusions 

Thi.  'hipt'T  has  stmli'd  the  multipl^fault  toii'ranee  of  the  l:S('. 
I  ll*'  iii  ii)  ()  }  ir  ive  an  e\pres,,)oi)  for  the  probability  of  los^  of  fault-free 


intereoiineetion  e.ip.abilit \  given  two  faults,  box  and/or  link,  in  the  in'twork. 
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CHAPTER  7 

PROPERTIES  OF  AN  ENHANCED  EXTRA  STAGE  CUBE 
INTERCONNECTION  NETWORK 

7.1  Introduction 

The  fault  model  of  the  ESC  presupposes  that  the  input  demultiplexers  and 
outj)ut  multiplexers  are  fault-free.  This  is  necessary  in  the  ESC  topology  so 
that  devices  u'^ing  the  network  will  actually  have  access  to  the  network.  In  this 
chapter  a  way  to  remov('  that  constraint  is  given. 

.•\  further  advantage  of  studying  the  multiple-fault  tolerance  of  the  ESC', 
as  was  done  in  the  previous  chapter,  is  that  it  may  shed  light  on  ways  to 
imprc've  the  alulity  of  the  ESC  to  retaining  fault-free  interconnection  capability 
in  the  face  f)f  multiple  f.aults  I'he  analysis  is  carried  forward  to  show  that 
input  and  mitput  :4age  boxes  are  a  significant  source  of  all  lossy  pairs.  A 
method  of  alleviating  ih!'-  eife,  t  is  given;  it  can  impro\c  multiple-fault  tolerance 
of  the  significantly 


7,2  Erirainating  Hardcore 

'I'he  hardcore  <>f  a  fault-tolerant  system  is  comprinents  which  are  assumed 
to  be  fault-free  Tor  the  E.4( '  the  hardcr>re  is  the  input  demultiplexers  and 
output  multiplexers  atel  lio  n  eontieelions  to  the  devices  using  tlie  network.  If 
an>  of  these  f.nb  then  the  de\ir-e  attached  to  the  nt'twcrk  at  th;it  port  no 
longer  h.'u-  a-cess  to  the  network  and  cannot  communicate  with  (i.e..  cannot 


.mi 


m 


s(,“nd  to  or  cannot  receive  nia.s^aj'es  from)  any  other  device. 

If  devices  attached  to  the  I'S( '  eaeh  had  two  inpiit/outpul  ports  then  ESC 
input  (leniult  ipk'vers  and  otit[)Ut  iiuilti(>lexers  would  he  unnecessary.  One 
output  of  e;n  h  de\  ire  Uould  he  colHie  le  i  dirfClly  to  a  .''tage  II  hox  input  the 
other  directly  to  the  corresponding  stage  n  mull i[de,\ er  Sirmlarlv,  one  device 
input  Would  c  uiiiert  directlv  to  a  .-'tape  0  hex  output  the  <,ther  Ui  the 
apjiropriate  stage  0  demulti[de  .xer.  '1  he  ad',  antage  of  this  .aj.pri  ach  is  that  a 
single  failure  will  not  d-:“ny  a  device  ac<ess  to  the  networf  f  igure  7.1  shows 
an  ESC  network  configured  in  this  manner 

7.3.  Individual  Box  Enabling/Disabling 

'I'll  eoreiii  fi  1  enumerates  all  loss'\  pairs,  given  that  stages  ii  and  0  are  each 
h\ passed  111  their  entirety  if  reepiired.  'I'hose  los  pairs  that  include  a  stage  n 
or  0  ho\  fault  ean  he  separated  (or  i  •  iiij-arison  with  those  that  dci  not  (The 
t. Tins  c.f  riieorem  (11  are  iioi  fiilh,  reiued  so  ,ls  to  clarify  thi.-.  separation.) 
I  t  he  hes\  jMirs  the  ratio  of  pair-  ;lal  incliide  a  stag"  ii  ainl/ur  0  fault  to 

!  ii'  OI  I  h  ,t  do  le 't  |s 

r  n  jN 
t.u 

f<T  [.l!-!os-_\  pairs  iIh-  ratio  is 

1\  l!  t 

I  i  -io.  \  pair-  <  .'iniiol  lie  hide  .Mage  n  or  0  te  v  fault.' 

I  l!  i-  I  I  LOe  i  III-  \.ihle  of  the-,,  ral.-o  f.,i  v.ir;oi|'  Valle's  of  a'ld  the 
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11  or  0  lidx  fail 
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shows  that  with  stage  bypassing  the  majority  of  Hit-lossy  pairs  contain  stage  n 
or  0  box  faults  Further, 


I  ^Nn-2N 

.  1  \  -  Gn  -  2 


oc  . 


So,  ;is  n'-tw(  rk  ^ixe  grows  ruor*'  and  more  HIFlossy  pairs  include  a  stage  n  or  0 
box  fault .  I  (ir  Fft-lossy  pairs 


lim 


2Nn 

4N  -  4n-  4 


oo 


d  he  data  prt-sented  in  'Fable  7.1  should  not  be  too  surprising.  Clearly,  the 
majority  of  |tlb  and  1. It  lossy  pairs  contain  stage  n  or  0  liox  faults.  Bypassing 
all  of  stage  n  or  0  given  a  single  fault  is  overly  aggressive,  especially  as  it 
guaranti'es  any  subsequent  fault  will  complete  a  lossy  pair.  Bypassing  only 
faulty  boxes  in  stage  n  and  0  can  improve  I'SC  multiple-fault  tolerance. 

1'lp’re  are  several  ways  to  support  box  bypassing.  One  method  is  to 
include  address  deroding  logic  with  each  box  disabling  circuit.  A  system 
control  unit  would  use  an  address  bus  to  command  the  multiplexers  and 
demult  iph’x ers  i.f  tlm  f.iih'd  box  to  the  disable  state.  'I’his  scheme  requires  N/2 
siuh  deci'  lnig  I’lrcuit'-  and  an  n  1  bit  bus  for  both  stage  n  and  0. 

(bie  is  to  d'  di  'a'i-  a  eontr  >1  line  for  the  bypass  circuitry  with  ea^h  box, 
instead  of  on*-  p.T  stage,  no  new  logic  circuitry  in  the  FiSC  is  required  for  this 
method  riiere  are  (wo  ways  these  lines  e,an  be  used.  In  one,  the  lines  are 
connerted  to  a  s\-tem  'ontrol  unit,  and  it  activates  the  ajipropriate  line.  This 
s(  ie  tie  i>  likely  (  .  be  uiiil  ers.  nie  for  large  niUwcirks  ,\n  alternative  is  to  link 
e.i' h  IitM  to  till  i|e’,  |.i  Tie,.’  tl)e  box  it  coiitroK  \^’hen  a  stage  n  or  0  box  fault 
oeeur^  a  -peeiai  fault  l.ibel  I-,  p.i.s^ed  tg\  a  -ystem  eoiitrol  unit  to  all  devices 
using  the  netvsork  l)ioi,e’  are  re^j'oiisiide  for  determining  the  corr<‘ct  state  for 


hoxi-s  uiuirr  their  control.  If  the  network  is  used  in  a.  unidirectional  mode, 
each  device  i.s  responsii  h  f<-ir  l)(4h  a  stag;e  n  and  0  hox. 


Operational  protocol  with  hox  hypassing  is  as  follows.  I'he  fault-free 
network  configurat itui  is  the  satin-  ws  xutli  stagt  h\passing  Ciiven  f  faults, 
1  <  f,  if  =f  then  all  stage  n  hr.xes  are  bypassed  and  all  stage  0  boxes  are 
enabled;  if  bg  -  f  then  all  slag'-  0  Im.xcs  are  bypassed  and  all  stage  ii  boxes  are 
enabled,  if  neither  of  the  preceding  ci'iiditions  is  trm  then  every  faulty  stage  n 
and  0  box  i"-  b\[iassed  and  the  n  inaining  stage  n  and  0  boxes  are  enabled. 
Note  that  box  faults  not  in  stages  n  or  0  and  link  faults  may  be  overcome  due 
to  the  two  paths  from  eie  h  source  to  any  destination  available  by  enabling  all 
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coniu'cted  is  hypassi’d  then  the  device  uses  trips  forne-d  as  if  stage  n  were 
bypa.'^'^ed,  i.e  .  the  fault  location  is  a  stage  n  box. 


7.4  Reliability  of  the  Enhanced  Extra  Stage  Cube  Network 

'I  he  value  of  I'(loss|  N,  2)  can  now  be  evaluated  directly  in  terms  of  fault 

3N* 

pairs  ffeniovinp  the  restriction  to  only  stages  n  -  1  through  1  adds  - 2.N 

new  Itiblossy  pairs  to  the  total  given  by  Equation  6  .3,  2N’*-2N  new  I>H-lossy 
pairs  beyuuvi  those  counted  in  Equation  6.4,  and  no  new  IjL-ios.sy  pairs.  The 
new  EE-lossy  pairs  include  N(2"^'-2)/2  involving  a  stage  n  box  fault,  and 
N(2^-2)/2  iiivoh mg  a  stage  0  box  fault  and  no  stage  n  liox  fault.  The  new 
I.IElossy  pairs  include  N(2"'^'-2)/2  involving  a  stage  n  Ix^x  fault,  and  an  equal 
number  with  a  stage  0  box  fault.  The  method  used  to  count  the  additional 
lossy  pairs  is  analogous  to  the  methodology  used  in  Chapter  6. 


Theorem  7.1  (oven  (wo  faults  in  an  E.SC  network  using  box  bypa-ssing 
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hicaii'-f  Ics-'V  [lairs  ar>‘  imw  defined  fac  all  stages  of  the  network.  Adding  the 
aewl\  re'-iilting  lossy  pairs  to  tli<!se  enumerated  in  their  resfiective  i-qualions 
({.qiialions  ti  d,  (i  I,  and  ti ')).  and  the  resiilluig  values  in  equations 

analogous  to  those  of  l.emma  fi  1,  fnu  lU'-lnding  all  iiifw  ork  'tages.  )  lelds  the 


terms  of  the  ahove  summation. 


r.ild'  7  l!  gives  tlie  ratios  of  probability  f)f  joss  of  fault-free 

iiilereonricetio.n  capatulitv  using  st.ige  b>  passing  to  using  box  bypassing,  b)r 
Hie  and  l  if-los-y  |  mr  for  example  Ta!-b  7  2  show-,  that  for  1024  an 

I'^'C  network  with  t>o\  bypassing  n  2  t>.$  time  less  likely  to  lose  fault-free 
int- r- ■.ipabiiit'.  •  e.  t  H.-'f.oilt  pair  than  e  an  fsSf'  with  stage 
!■'.  p.i.  .e'j  [  f  I  lefiuit  pair-  tio-  I  .',i,  I-  i>  (12  t  .If  e,:iirse,  .‘d,ii!  i,  .na!  IsSC 
ir  iS."  of  an)/  ii-i-i  i  ,  i:i.|  /•  ■a- a  t  '  e.;ri  fo!  in  .an  a'Onal 

U'  ■'  dll'  '  deiTe.-tsinr  r  e!  la  I  •  ■ !  il 
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1  of  prohahility  of  loss  of  fault-freo  intcrconnecUon  capability 

using  stage  bypassing  versus  box  bypassing,  for  HEi-  and  LB-lossy 
pair  types  I'lie  probability  of  loss  of  fault-free  interconnection 
capability  due  to  LL-iossy  pairs  does  not  depend  on  bypassing 

met  h'  'd . 


Id^lossy  pair  ratio  Ll^lossy  pair  ratio 
stage  bypassing  to  stage  bypassing  to 
box  bypassing  box  bypassing 


l'>  (friiu  'rh-Mirfin  -}).  riii'l  P;  1- is.s  1  N  -  lOL’ 1.  L’ )  -  0 ‘J6(3  using  box 


i\)';is'ong  '  ilii'  irup'.  I  rii.'i'T  b>  -i  of  3.').  Tyvn  sjii'i  lal  cases  are 

i  ’( 1  I  I  )  ”  1  ;uii!  r(lU  j  F  )  =  1  If  1  ■( l.I  i  1  )  -  1  1  e  ,  I'( l>F  j  F  j  -  0,  then  there 
;i  fsui!  I  1  ::',re  <■  inijirMviMnMi:  titrur  1  ypassine  j.olicy  w.i'.l  cnalili  all  stage 

n  ainl  (1  in  the  .'vcr.t  of  a  fault  or  faults  If  I’(BF|  F)  -  I  crilv  HfFfault 

[  irs  AT’  (-'o-il-le  ari'i  ihe  firs'  Oil.iiun  of  Fable  7.‘J  ahovv^  the  factor  by  v.hich 

I’ll' 'as'  N.  'J  )  '\(uKl  <1  ■  F'  ,'LSe. 

i'h'  til  lent '  "f  lie-  filiation  foi  i’fiossj  N. ‘2)  vMth  b.o  byjoissiug  have 
i'"fii  N  erilif'l  b}  ilire  -t  fiuirner:itioii  for  N  =-4,S,  lt>  32.  and  61.  Sec  Appendix  f.i 
'or  'hf  eoiiipiii'-r  pf'  gram  ii'-ed  Ftif  d.'i.'-hfl  lines  in  Figure  7.2  show 
l’iho-,\  2)  With  I'ox  [lie  -lilt’  f.'i  1  <  N  <  I ‘ii2  and  a  range  of  F(BFjF) 
\  '.i'le'-  \'t‘  that  'lie  diLsIifd  litf  f'l'  F(t;i  I  Fl  -- 0  i'  eoincideiit  with  the  solid 
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are  equally  likely,  Th(‘  fact  that  lar^e  netuorks  gain  more  than  smaller  ones 
underscores  the  value  of  this  inodificatiou  for  use  in  large-scale  parallel 
processing  systems:  larger  networks  are  more  likely  to  experience  a  fault  at  any 


CHAPTER  8 


ANALYSIS  OF  ALTERNATIVE  SWITCHING  ELEMENTS 

8.1  Introduction 

Aftor  capability  ha.'c  been  established,  a  next  step  in  the  engineering 

of  a  new  interecinneetion  network  is  a  consideration  of  its  implementation.  The 
advent  of  \'ery  Large  Scale  Intergraticm  (VTSl)  technology  motivates  the  use  of 
complex  btiiUiing  blocks  in  constructing  a  system,  especially  when  the  number 
of  distinct  parts  can  be  kept  smalt.  In  recognition  of  the  growing  applicability 
and  capability  of  \  LSI  technology,  many  papers  have  included  discussions  on 
using  switching  elements  larger  than  2x2  interchange  boxes  for  implementing 
multistage  cubc'-type  networks  [.AdSSlb,  ChM82,  CiSBl,  CiS80,  HuM81,  Law75, 
MAS81.  I’at''!.  I’eaTT.  I’KMXO,  Smi8l],  The  ESC  was  defined  in  terms  of 
intf’rchangt  bo\e>  by  pa.ss  cir'Miltry,  and  the  connections  Ix'tween  them,  thus, 
the  I. SC  can  l)e  i  (instructed  from  just  a  few  different  parts.  Interchange  box 
(  irriiitry  rf  quirements  do  not  tax  the  current  state  of  the  \TSI  art  (pinout 
rtaiuir'-nient''  ar“  somewhat  more  challenging). 

,\n  interchange  l)ox  is  a  2-input /2-output  crossbar  switch.  A  more  general 
form  of  interchange  box  is  an  a-input/a-output  (ax  a)  crossbar.  A  network 
elo.'-e'y  related  to  the  Lx('  fail  be  constructed  from  ax  a  switching  elements, 
a  ~  2’  for  I  -2,. 'Ll.  using  cube  intercfinnections  between  stages.  This 

<  ha[  ter  examines  the  rcl.ative  nierits  of  2x2  and  4x1  crossbar  switching 
elements,  or  nodes,  with  regard  to  performance.  A  4x4  crossbar  node  is  a 


logical  successor  to  the  interchange  box,  yet  it  is  not  so  complex  as  to  preclude 
implementation  on  a  single  \1,SI  chip  (MAS8l{. 

'Fhiis,  this  chapter  addresses  one  high  level  issue  confronting  a  network 
(b-Mgiier  ehoice  of  switching  elenieni  architecture  to  satisfy  a  performance  goal. 
Other  di'sign  goals  such  as  cost,  power  consumption,  testability,  and  reliability 
are  also  a.sfiects  of  any  development  prr»ject  and  will  mandate  switching 
eh'riient  structure.  Other  high  level  issih's  such  as  choice  of  integrated  circuit 
technology  and  pa<kagii(g  constraints  v^.i!  mfluetice  the  choice  of  structure. 
These  issues  would  be  decided  in  Co'iijunction  with  or  subsequent  to  the 
invi'st igation  c.irried  oiii  here.  Beyond  these  high  level  issues  are  low  level 
issues  such  as  details  <.f  circuit  design  circuit  layout,  and  design  of  test 
sequences  Oole  urreiit  and  sutiscquent  activity  in  such  areas  as  design 
verification  and  siimilaiion  ar<>  also  neceseiry  to  bring  a  conceiil  into  physical 
realization 

Foll.iwing  .a  riMcvs  of  exictiiig  literature  on  a  x  a,  a  >  2,  switching 
'•h  nients  fi  r  use  m  multistage  cube-type  networks,  the  switebing  elemimt  model 
(.1  f  r  he  ['erforiii:in<e  airilyis  lien  is  gi\  ( n  \\ith  this  background  the 
i  ■'•rfi  I  rr.i  n  e  d  lx  t  M'ssl.ar  no  h-,  ,.ioi  a  lirijuit/ t-(>iH|uit  node  constructed 
from  111! ere!: .angc  l’<c\--..  i,.  foinc.ared.  A  m'"liod  of  c-uistructing  KSC  networks 
from  these  switfhing  el, 'merits  and  thrir  effects  on  network  performance  is 
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8.2  Previous  Work 

(Xfi^rs  fi.ivi"  ( ■  ?!  U!i{il»'riu'nt  at  ion  of  rrnjllistap;p  rufie-tyfie  network' 

usiiie;  lint  cs  fur  intiTi  haiiC"  l>ox(‘S  (2  x2  sw  ilcties).  In  f!’ea77],  Poii.-^e  notes 
til'll  iheri  1'  'onsKlcral.ii'  flexibility  wfuni  making  im[Xemenf  at  ion  decisions 
'•?ni  ‘rnin;:  tt;e  i!i'iire?t  bttiary  n-cube  network  In  [Law7.S],  Lawrie  mentions 
'vverai  design  ojtions  for  the  Omega  network.  These  networks  are 
topologically  e(|ui\a!ent  to  the  Generalized  ('ube  [SiS78],  as  mentioned  in 
Cliapter  .'b  Section  3  2  One  possibility  is  to  substitute  4x4  omega  networks 
for  a  group  of  fair  2x2  excliangi^broadcast  units.  7'he  functionality  of  an 
omega  iiet\«.ork  built  from  these  elements  would  be  identical  to  one  constructed 
from  2x2  elements  Niioiber  possibility  mentioned  is  to  eonstruet  the  4x4 
elements  uitii  crossbar  an<l  broadcasting  capabilities.  In  this  case,  the  resulting 
tiefaork  is  more  pov-erful  than  tlie  basic  omega  network  eonslructed  from  2x2 
elements. 


^  . 


iir« 


I  he  jiossibility  I'f  using  crossbars  of  size  a  x  a,  where  a  =  2,  4,  'n  8  to 
im['|enieri!  ihgit  ronlrnlh  d.  or  delta.  iK'lwmrks  is  discussed  by  Patel  in  |Pat8!] 
ftiie  t(,  ie(h;i'  l.igic.'i!  limit  .'ll  ions.  Patid  concludes  that  cro.sshars  of  size  8x8  are 
pr-  '  aldx  U')t  pr.-utica!  at  tli''  }»r<'senl  lime,  howi'ver.  4x4  crossbars  are 
considered  feasible.  .\  jurformance  analysis  is  presented  for  della  networks 
wilii  2x2  crossbar  eiemenfs  Tfie  jirobabilily  of  acrepiance  is  llie  likebhooij 
that  a  proo'ss  .r  retpiesl  for  connection  through  the  network  will  i>e  granted 
Iniridv  tilth  is  defined  as  the  expected  number  of  requests  accepted  p(‘r  mdwork 
■rebe  ri'.i'  probaliii'y  of  a'eepii  anee  of  a  re<iui'sl  and  the  expiated  liandwidtb 
of  'iTiv  2"  X  2‘  'lelta  n‘'’^^ ork.  given  some  mean  request  generation  rate,  u 
given  ill  tho  paper  'i'he  re'ull-  are  extended  to  a"  x,  a'‘  delta  networks 


m 


m 


The  porforinanco  of  bufTi'r<‘(l  delta  networks  in  a  packet  switched 
environment  is  considered  by  Dias  and  Jump  in  [DiJSl],  The  omega  network,  a 
member  of  the  (lass  of  delta  networks,  is  analyzed  in  terms  of  throughput  and 
t urn-anuind-time  Throughput  is  viefined  as  the  average  number  of  packets 
dtdivered  in  unit  time  for  a  particular  network  and  environment.  Tum- 
around-ttrne  is  the  average  time  interval  between  the  time  a  packet  is  placed  in 
a  buffer  at  a  network  input  and  the  time  at  which  it  is  placed  in  a  buffer  at  a 
network  output.  'Fhe  analysis  and  s:mulalions  show  that  in  terms  of 
throughput,  buffered  delta  networks  compare  favorably  with  unbuffered 
crossbar  netwv.rks  in  a  packet  switched  environment.  The  study  is.  however, 
carried  out  only  for  delta  networks  using  2  x  2  erossl;ar  switching  elements.  As 
with  the  sjiecial  aiso  of  oiiMga  networks,  delta  networks  using  1x4  crossbar 
switching  elements  will  be  more  powerful  than  those  with  2x2  elements. 

<  iminiera  and  Serra  have  investigated  the  probabtlitv  of  accej>tance,  given 
■an  asyn.  hronous  lontro!  scheme,  of  a  chnss  of  ciilx^type  networks  using  circuit 
'Pitching  [t’iSxl]  Again,  acceptance  is  defined  as  the  immediate  granting  of  a 
pio',-  ,,.r  re(|ue  t  for  a  c.  inii'-rl  ion  il.ia  ugh  the  network  The  assii  nij<li' ins  used 
o.  [1  ’  ,1  r’s  i  .r,'  g' ui-raO'  r, and  an  uid' j>cnd 'ut  re(iucsts  uniformly 

iioributc  i  ;iiii  .ng  a!!  tin  memory  Modules  (jf  the  system;  reque.sts  arc  granted 

or  rej's  ted  m  negligibh  titm  ,  and  onlv  one  recpiest  m.ay  be  generated  by  a 
;  r  r  0  i  line  \\  ;tii  iln-  :  i  'I'd  itn-  i':  ■  .b.it'ilitv  of  aeci-i't  ance  is 

•\  ilei;"!  'll;  1  -e  of  ;i!i  .^v\!  hro:,  t,  loniio]  scheme  IS  .1  mode)  wliich  finds 

[i art )' iiiar  ,ap;'li -.ibiiit  \  in  MIMD  'o  i  i-oc 

'l  l"  c||  .s  .f  '-W -li.aii)  ati  ;)‘!u..rk  .  is  v  ^r;.  cein  r.il  KloI.Td, 


'"•'■i'  hiiie  orii  (lire-,  of  a/o  ...  b  ,ir.  pois.sibb-  wlicre  a  and  b  ar<  positive 


comp.arfd  b.uivan  and  crossbar  networks  |f  ra-Sli 


I  rank  111!  h  as 


i!"  !  U'^Mie  m  tncx'^urf  f'htainod  from  a  model 

iii'  i  ri  ■  '  r  ii;>i  i'io  !'  ;il  \  i  '-1  inifd'  in''iit  atioii  area  req\iiremeiits  and  network 

■ie]  i\  (  r  i:  n'  t>>.  .rk-  w.-r"  found  lo  iiave  >imilar  [lerformanee 

li:ir;.o  r'  '  oiM'.an  miu.^rk''  witli  re-peel  i..  ! ii'.s  i  riterion.  Tins  meri-aiire 
I'  'I  -led  1  !'  ;  :iO  1.  mI \ri\  emted  f.  ,r  tpiaritifyine:  network  porformanee  pivon 

;  T  ,l  ■  e  :i,;  '  I  ;  f  \  !  '- I  1 ;  1 1  [  ■  |e  m  e  ft  I  a  t  io(l 

8  ?  Switching  Element  Model 

i  tie  j;er  il,.  ' d  (  like  operation  model  assumed  for  the  analysis  in  this 
'i.Mio,-  n,  '.,,1,,,,  T,  ,//:riy  to.g.^  !Law7.'),  SiM-^ll'l  and  ]'ackct-/>n'itrhed  message 
h  in  iii'ii:::  k!  I  tiai  o.  a  packet  consisting  of  a  routing  tag  and  a  number 

'f  i  ita  It' me  mnke-  p-  wav  from  stage  to  stage,  releasing  lines  and  switching 
no  I  imne  dei cl',  after  lining  them  The  si/e  of  each  inpnt  queue  in  a 
-v*.  it  ■  hi!:';  !io  ie  i,  •i--;-um"d  to  iie  an  integral  rnultiph*  of  tlu'  packet  si/e.  Thus, 
pti'k'i  1,  tioi  riotri'ted  tf)  any  parti'-ular  numher  of  words.  Packet 

"un' hing  o)  ni)!;  i-i  age  networks  constructed  from  interehange  boxes  has  been 
!i.  m d  i:  I  hi'.),  Di.l'-o  Nb-.-tsoj  I'aeket  switetiiug  ;n  SW'-banyan  networks 
:.  .-  ;■■■■  ;.  d'-  ■!■■■  d  ri  r!  TPl 

1  te  'i:  ‘i  '  til '  I  ha  i  ! '  r  I  -  to  in\  a-st  igat  '•  the  performance  of  -1  x  -1  crossbars 
■' '  T' J  ■  er  ,.  I  ,r-,  ( iTii I  rchaii'ge  boxes)  Since  a  single  inter'cliange  Itox  is 
fe  t  riiii  I  enp  irabie  te  a  4x4  crossi>ar  (i.e.,  it  can  only  handle  two 

i*'!!'-  :i'  a  '  I  O'  iio'ea'I  ■ 'f  four'i.  the  4x  1  cr.)s'-tiar  is  '■■ompared  to  a  4-input/T 
e*;  :'  i  p  .p..  ''  f  iir  jporihatige  iioxee  'Idii-  configmo'it ion  is  c.ille'l  a 

t  !  .  •  :  •  !"  !'e’  I  vhoun  m  1  ignre  s,  ]  l.rrr!  1  of  a  enniji'isitr  n'l'b- 

I  '  ■  '  r  ■;  .  e.  I  -  h  '  ■  -U  1)  eel  i  ■. )  to  ;)|c  .node  (11  put,''  l.rrr  I  p  consists  of 

!  t"  I  .[,m  b'V".  -  inne’-led  to  the  oiit[)uts.  .-V  ( ietn'rali7ed  Cube 


1 

i 
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curist  rijcti'ii  fri'in  j -Mpt-rly  coimcctf'ii  (to  ti.  ^pcrifH'd  later)  4x4 
funipnvite  ^  is  uliTitiea!  to  one  const  riK  t»’<i  frutn  interrhang;e  boxes 

Ia  arTiinat  e  >11  'f  t!ir  1  x  1  '-res  bar  nod--  ui  1  irere  21  :i)  slews  that  its  external 
I  ■  in  nr  ’  j,  ,[j  ~  Hfr  nbiiti'a!  Id  those  of  the  Iv  1  <'i  riipo-itc  node,  so  it  can  he 
dire.  tb  siihA  itiiied  for  a  1x  4  eoinp’osite  nod<  1  i;;iire  2(b)  depiets  the 

ssreiTi/  or  wit'di  li'tw.-.i,  an  itipnt  and  .an  output  Henceforth,  both 

erossl'ar  and  eoni['osii.'  nocli-s  .ire  assumed  to  (■<'  1  s  4.  unless  otherwise  stati-d. 

'I'fv  perfortti.ance  '  ttie  eriosb.ar  Ft  >fje  .and  the  ce,mposite  node  will  be 
riipari  1  on  lio’h  a  loral  and  plobal  le\  id  On  the  local  level,  blocking  within 
a  node  ;s  'A  amiru  d  < 'n  i!;e  c! -bal  |e\  (d  the  permuting;  ability  of  two  networks 
(onArneted  from  the  resp..  tt\e  switching  nodes  is  compared.  Intuitively,  one 
would  eypec!  tliat  tlie  eros^liaf  nodes  Would  be  suj'erior  with  resp(>ct  to 
blocking  and  that  networks  constructed  from  them  could  perform  more 
I'errmitat  lolls  Indeed  this  is  the  ca.se.  The  purpose  of  this  section  is  to 
ipi.mtify  lliesr  (lifTeren  es  m  performance.  In  the  nnalxsis  for  both  the  crossbar 
n odr  and  tli"  composite  node  of  the  tune  required  for  messages  to  pa.ss  through 
till'  nodes  'h>'  fotlietMiig  a,'sumptions  are  made 

1  hut  lalls .  ill  ides  are  eniidy. 

!  tier'  f  in  be  from  to  4  niessagt's  at  the  inputs  of  a  node  at  any  one 

tine, 

•d  1/uh  message  h.as  only  one  destination  (i  e  ,  no  broadcasting) 

i  i’ he  de.i'i  i  co.jc  oulpiit  for  each  m»‘ssage  is  a  uniformly  distributed 
rat'd’  uii  \  ariable 

) 


,\  nu'isage  reipiiros  the  ~ame  tmi"  to  [lass  through  either  a  2  x  2  or  4  x  4 
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crossbar  Ilcncc,  the  minimum  time  to  traverse  a  4  x  4  crossbar  node  is 
half  that  of  the  4x4  composite  node. 

1  he  third  a.ssumption  is  made  to  simplify  the  analysis.  The  fourth 
x^surnption  is  equivalent  to  asserting  that  message  destinations  (network 
outputs)  are  a  uniformly  distributed  random  variable.  This  is  because  if  each 
nodi'  output  js  si'lected  with  equal  probability  then  there  is  an  equal  probability 
of  reaching  any  given  network  output.  The  last  assumption  is  based  on  the 
circuit  cliaracteristic.s  of  the  two  node  designs  discussed  in  (McM82].  Based  on 
levels  of  logic  It  is  reasonable  to  assume  that  the  nodes  would  operate  at  similar 
spi'i'  Is  However,  if  there  is  a  speed  differenee,  the  performance  analysts  of  the 
following  sections  would  remain  valid  after  the  application  of  the  appropriate 
w'  lghting  factor  to  the  results.  The  last  assumption  simplifies  the  analysts. 

8.4  4X4  Crossbar  Node  Performance  Analysis 

Ihe  time  required  for  a  single  message  to  pa.ss  through  a  crossbar  is 
ri’b  rr'  d  to  as  a  ttrnr  strv  l  et  t  be  the  maximum  numlxT  of  messages  destined 
r  r  Ml*  '•■mi''  rn-sbar  iio.i>'  output  Then  t  is  the  time,  m  units  of  time  steps, 
r'qmr'fi  for  .ill  mossn^es  to  transit  the  node,  since  the  t  contending,  or 
counKting.  me-s.ic-s  must  exit  the  node  sequentially  and  no  other  output  node 
h.-Ls  3  gr.  nii  r  number  of  nie.ssages  to  pass  Let  m  represent  the  number  of 
messages  present  at  the  inputs  of  a  node.  The  probability  that  the  time  for  all 
riioss.sg"-  lo  (r.iir-it  a  node  is  T.  given  that  there  are  Af  messages,  is  denoted 
f’(i'd;m'M)  d  he  e\pe(te(l  transit  time  given  M  messages  is  denot<.d 
Idf  '  re  N!  I 


m- 


If  a  single  message  is  present  at  the  inputs  of  a  crossbar  node  it  will  pass 


through  (transit)  in  one  time  step.  Thus, 

P(t-l|  m  =  l)  =  1  . 


I  ’f  m  -  2,  transit  times  of  one  or  two  are  [.ossible.  A  transit  time  of  two 
occurs  only  if  the  two  messages  contend  for  a  given  node  output.  There  are 
four  ways  this  can  occur  one  for  each  node  output  Since  mt'ssage  destinations 
ate  iiidept'tident  there  are  4"^  distinct  choices  of  destinations.  Thus,  a  transit 
time  of  (ine  occurs  for  all  d*"  choice.^  except  the  four  with  contention,  Sii 
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output,  there  are 

I 


1  »  3  «  2  =  ‘Jl  jmssihlo  choices.  Hence, 


n,  =  llm=3)  =  ^  =  f 


If  <oIIl.‘  lUltpUt 


has  two  tnessap;es,  then  there  are  *1*3  -  36  choices.  The 


factor 


is  the  mirnher  of  message  pairs  which  can  be  assigned  to  any  of  the 


a  _ 

f(M.r  (>utputs  Recall  that  ^ 


(a-b)!b! 


I'he  remaining  message  can  go  to 


of  the  three  imvised  outputs.  Thus, 


l‘|t  ~‘l\  rn  -  3)  - 


36  0 


^3  16 


rtu  re  are  four  w'a\s  throe  messages  can  contend  for  one  output,  requiring  three 
tirne  s;,.[,s  for  all  to  transit  the  node.  Hence, 


l’(t  -3[  m-3|  -  ~ 
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J.\,r  Ill  =  1,  1  <  t  <  4  tlur«‘  are  4!  ways  the  messages  can  have  distinct 


outputs.  So, 


iir-4) 


There  are  two  ways  t  ~2  can  occur  with  in  —  4.  One  is  when  only  one  output 
has  two  messages  There  are  [.>|*l»3*2  ways  this  can  occur  Another  way  is 


fur  two  outputs  to  ha\e  two  messages  each  There  are  ,,  *  ,,J  ways  to  realize 


this,  the  reitsoiung  is  a.s  follows  The  first  fa  dor  Is  the  number  of  message 

p.airs  that  can  tu-  formed  by  taking  two  messages  from  four.  I' urt hermore,  it  is 
also  the  number  of  message  pairs  that  can  be  formed  !>>  using  all  four 
niessa.gi's  Min  i'  i  lioo-'iiig  (>ne  [eiir  imj'lieitly  (  reafes  a  second  d  he  s(>cond  |.jj 
factor  I  the  niiiiiber  of  ways  to  choose  two  distinct  out['Uts  for  the  tiu'ssage 


[lairs  I'hus. 
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I'll''  (  rei  ' (line:  an.''.!v^i'  is  most  applicable  to  the  operation  of  switching 
c!<  ni''nts  m  an  MIMD  system  1'his  is  because  the  number  of  messages  present 
at  any  given  tint''  "ii  '■v, itching  «'lemenl  inputs  and  message  destinations  were 
lioifi  a>'.nm‘'(f  to  im  ran  lorn,  arnl  in  (lie  case  of  destination  addresses,  uniformly 
dist  ril'iiiecl  {\o  assumption  was  made  about  the  distribution  of  arriving 
messag'’s.)  'I'll'"-''  a''vunipt ions  form  a  plausible  model  for  MIMD  computations, 
'III'*’  sin  1;  [Toc'i'-mg  is  characterized  by  oornmunication  that  is,  typically, 
we.'iKlv  ■••t  rii' t  nr'-d  and  "f  variable  intensity. 


<)[-er:i!:on  d  a  --uiiching  element  in  an  SfMD  environment  will  niore 
ii'iiallv  en'  til  I'crmnt  >i ion  coiin'‘ctions  among  the  devices  using  the  network. 
( '  re-ider  I'n'  'cos.p.ir  no,ii>  in  these  <  ircuinst  inces  (i.e.,  ni  ~  4  and  each  message 
i-  d'ofirn  !  f  >r  a  iiniou*'  network  output).  The  crossbar  node  is  capable  of 
a!!  1  perMini  at  ions  of  four  items  without  conflict.  Wbilc* 

neivc  rk  'Opel  :,r''  i':u.p,te  for  each  message,  network  topology  may  require 
'-'Verii  Iiv  ,  |C  .  1(1  I;-..  ih'  :-:Miie  .Mltjuit  of  sollie  switelnilg  element.  I'lllls.  the 
pr  h  il'iliiN  that  '-vMi  him:  el.  nit-nta  will  introduce  delay  .v-  a  result  of  c-onflict  i.- 


of  interest.  If  node  output  .selection  is  random  with  uniform  distribution,  then 
for  the  crossbar  node 


P(delay  m-1)  -  1  -  P(no  eonfiictj  rn— i)  —  1  ~ 


il 

4' 


0.906  . 


Thus,  90.6  percent  of  the  time  when  four  messages  try  to  pass  through  a 
crossbar  node  simultaneously  there  will  be  conflict,  assuming  random  node 
output  selection  with  uniform  distribution.  However,  the  permutations  used 
frequently  in  a  particular  SL\fD  calculation  [Len78j  and  the  topology  of  the 
chosen  interconnection  network  may  result  in  node  output  needs  that  are  far 
from  uniformly  distributed  for  any  given  node.  Thus,  P(delay[m  =  4)  may  not 
be  us*’ful  for  understanding  network  level,  as  opposed  to  switching  element 
h  vel.  behavior. 


8.5  4X4  Composite  Node  Performance  Analysis 

The  analysis  of  the  comj>osite  node  presented  in  this  section  is  much  less 
straightforward  than  that  of  the  crossbar  node.  Despite  the  fact  that 
;Hrfori;iari<'e  is  ni(.-i>ur"d  here  in  terms  of  tiiiu-  to  p:i.s.s  information,  the 
^  "  '!*'  ncde  analv.^.r.  j'.  preseni'-d  as  en.'  c  by  ( a-e  analysis  of  the  wavs  in 

which  'onflict  '.an  occur  in  tlu-  node,  l’h<‘  comjileteness  and  accuracy  of  the 
analysis  is  more  apparent  with  this  np['roach.  Performance  values  can  then 
•  b‘-  a-s.  r;  ii(  (l  w  ith  '•  ach  (  a^c  h  r  t  (.ifipari-^on  of  (he  composite  node  with 

t to'  rn issbar  lu.de 
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8.5.1  Problem  Overview  and  Notation 

Table  8.!  lists  all  ways  conflict  can  occur  in  fh'*  composite  node.  For  each 
way  the  pi'rforniance  analysis  ca.ses  are  given.  The  (able  includes  new 
terminology  tlmt  is  used  in  the  composite  node  analysis.  A  delay  occurs  when 
a  message  arrives  at,  or  requests  arri\  .\1  at,  an  interchange  box  input  which 
already  has  r>ne  or  more  messages  awaiting  passage  through  the  box.  The  new 
message  must  wait  in  turn  for  passage  and  is  therefore  delayed.  A  subsequent 
delay  is  .simply  a  d'day  that  occurs  after  an  earlier  delay  or  earlier  conflict. 
Figure  8,.q(a)  shows  an  example  of  a  subsequent  delay  due,  in  this  instance,  to 
an  earlier  conflict.  In  ( 1|  messages  A  and  B  are  shown  in  conflict.  To  resolve 
the  c(>nflict,  .V  is  arbitrarily  selected,  so  (2)  shows  B  ready  to  transit  the  box 
one  time  stiqi  later  Tlie  transit  time  for  B  is  two  time  steps  because  of  the 
delay  subsequent  to  the  initial  conflict. 

subsequent  eonjlict  is  analogous  to  a  subsequent  delay.  Figure  8.3(b) 
shows  a  subsequent  conflict.  Initially,  messages  A  and  C  conflict  and  message 
B  requires  use  of  the  same  box  input  as  A.  The  conflict  is  re.so]ved  by 
arbitrarily  <  boosing  .X.  and  (2)  shows  B  and  C  at  the  box  inputs  after  one  time 
'tep.  Because  B  ha[)pens  to  need  (he  same  box  output  as  ('  (and  A),  B  and  C’ 
conflict.  This  conflict  is  subsequent  to  that  of  A  and  C. 

An  arbitration  delay  is  a  delay  that  occurs  as  a  consequence  only  of  the 
way  a  conflict  in  an  interchange  box  is  arbitrated.  Arbitrating,  or  resolving  a 
conflict  means  selecting  one  of  the  conflicting  messages  to  pass  first.  The  other 
message  waits  and.  thus,  may  cause  a  delay  for  a  third  me.s,sage  Changing  the 
choice  for  resolving  the  conflict  would  eliminate  that  delay.  Figure  8.3(c) 
shows  an  arbitration  delay.  In  (I)  messages  A  and  ('  conflict,  and  B  requires 
use  of  tlu  same  box  input  as  A.  After  one  time  step  and  the  selection  of  C  to 


Tal'lf  S.l  A  complete  listing  of  the  ways  conflict  can  occur  in  the  4x4 
composite  node  as  a  function  of  m,  the  nuntber  of  messages 
initially  at  the  inputs  of  the  node.  The  associated  probability  term 
is  given  after  each  fable  entry  corresponding  to  a  case  of  the 
performance  analysis. 

1.  m“l  (no  conflict  possible);  P(t=2|  m  =  l) 

11  m-‘2 

No  conflict  in  level  1 

1.  .\(.>  conflict  in  level  2;  P(t— 2  j  m=^2) 

2.  C'onflict  in  level  2;  P(t=3|  m=2) 

H.  ('’onflict  in  level  1  (cannot  also  have  conflict  in  level  2);  P(t=3|  m==2) 

Ill.  m"-3 

.A.  No  conflict  in  level  I 

1.  No  conflict  in  level  2;  P(t=2j  m=3) 

2.  Conflict  in  level  2;  P(t“3.  case2|  m=3) 

P.  Confli't  in  level  1 

1  No  conflict  in  level  2,  P(t=3,  case  1  |  m  =  3) 

2  (  oaflicl  in  level  2 

a.  One  conflict  only,  P(l  3,  case  3  |  ni=3) 
b  One  conflict  and  an  arbitration  d'day;  P(t=4i  m=3) 
c.  One  c  aiflnt  and  a  subsiaiucnt  conflict;  P(t  — ll  m=3) 
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Table  8.1,  continued. 


IV.  rn  =  l 

A.  No  conflict  in  level  1 

1.  No  conflict  in  level  2;  P(1  =2  j  n>=4) 

2.  C'onflict  in  level  2;  P(t=3,  case  1  j  m  =  l) 

R.  Conflict  in  level  I 

1.  No  conflict  in  level  2 

a.  No  conflict  in  level  2  given  three  messages  enter  level  2 
initially;  P(t=3,  case  2  |  m=4) 

b.  No  conflict  in  level  2  given  two  messages  enter  level  2  initially; 
P(t=3,  ca.se  4  [  m  =  4)  and  P(t=3,  case  5  [  m  =  4) 

2.  Conflict  in  level  2  given  three  messages  enter  level  2  initially 

a.  One  conflict  only:  P(t=3,  case  3  [  m=:4) 

b.  One  conflict  and  an  arbitration  delay;  P(t=4,  case  I  j  m=4) 

c.  One  conflict  and  a  subsequent  conflict;  P(t=4.  case  1  |  m  =  4) 

3.  Conflict  in  level  2  given  two  messages  enter  level  2  initially 

a.  No  conflict  for  first  two  messages  entering  level  2,  but  conflict 
for  the  second  two  messages;  P(t=4,  case  3  |  m=4) 

1)  Conflict  for  the  first  two  messages  entering  level  2  only; 

P(t  ~4.  case  2  |  m=4) 

c.  Conflict  for  the  first  two  messages  entering  level  2,  and  a 
subsequent  conflict;  P(t  =  4,  case  4|  m  =  4) 

d.  Conflict  for  the  first  two  messages  entering  level  2.  a 

subsequent  conflict,  and  an  arbitration  delay;  P(t=5  |  m  =  4) 

e.  Conflict  for  the  first  two  messages  entering  level  2,  a 

subseejuent  c<uiflict,  and  another  subsequent  conflict; 

P(t--.'.  j  ni~4) 


-m 
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rcN.'lvt'  llu‘  c  inllict,  the  ini'ssagcs  an*  as  iti  (2).  Mi'ssagt*  B  must  wail 

ni'>rc  time  step  for  A  to  <  loar  the  box  beforo  it  can  pa.ss,  even  if  H  needs  a 
diffi  reiit  b()\  output  than  A,  as  slujwn  in  (3).  Note  that  if  A  were  chosen  to 
r«'sol\e  (h<'  initial  conflict,  B  and  ('  would  not  subsequently  conflict  in  this 
cxa'iiph'  'I'hus  regardless  of  the  box  output  need  of  B,  it  is  delayed  as  a 
consequence  of  the  resolution  of  the  conflict  between  A  and  C.  Note  that  an 
arbitration  delay  is  a  special  type  of  subsequent  delay.  A  conflict  is  never  the 
result  only  of  arbitration  because  the  two  messages  involved  in  a  conflict  must 
need  the  same  interchange  t>o\  output. 

Two  examples  follow  to  illustrate  how  the  entries  of  Table  8  1  are 
letcrmnud  i'irst,  eonsider  the  situation  depicted  by  Figure  8.1,  which 
corres[ionds  lo  table  <'nlry  I\  ,B.2.b.  Figure  8.4(a)  shows  the  initial  message 
po^iti'Uo  and  the  h’vel  1  box  connections  chosen  to  satisfy  the  conditions  of 
til!--  tabic  (  ntry  The  uppt'r  box  of  level  1  has  a  conflict.  This  meets  the  first 
'  lub'!  ti  (a  confliet  in  level  1)  of  r\',B.2.b.  Next,  Figure  8.4(b)  shows  the 
mc‘>s,ic-.  loc'ifiiins  in  the  node  and  updated  box  connertion  requests  after  the 
hr  I  time  -lep  N'ess.ige  ,\  w ;i.s  chosep  arltitrarily  in  the  first  time  step  to 
re-  i\e  ihe  1  uiflu  t  in  tlie  upi>er  level  I  box  At  thi.s  [loint  a  conflict  exists 
between  \  and  <’  on  hoe!  2,  satisfying  the  seetmd  condition  of  I\MF2.b.  The 
e'uuie  tiou  reqiii'st  sliowii  for  message  I)  is  arbitrary,  it  is  not  affected  by  the 
eonditr  -rm  def  iling  !\  It  2  !>  ease  If  during  the  second  time  step  the 
-irlut r ‘1 1. Ill  pforcss  selects  (  ',  then  the  resulting  message  I'onfiguration  is  as  m 
i  i'.o  re  ■■  1  I  \b-  aj.'e  i-  in  a  position  sU'h  that  it  will  be  delayed  bv 

f'lbdM'.-i  ill'  !  i.si  e- .,,,1:1  e  ,|i  I  ,f  I\  ,lt,2  b  lignrt'  S-lfif)  allows  the  message 
!'"  i'i,  I,:-  afier  ihrei'  time  st>j'  Ml  nies'uijri-s  (dear  the  node  by  the  next  time 


•U  en 
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As  aiKither  oxamplo.  Figure  8.5  illustrates  entry  l\Mi.2.c  of  Table  8.1. 
rtie  initial  situation  was  arbitrarily  chosen  to  he  identical  to  that  in 
I  iRure  8  l(ai  Without  loss  of  generality,  A  is  arbitrarily  selected  to  resolve 
the  eutifliei  in  le\ cl  1,  as  in  the  previous  example.  If  me.ssage  A  is  selected  to 
resolve  the  lirst  conflict  in  level  2.  shown  in  Figure  8.5(b),  the  result  is  as  in 
Figure  8.5(c).  The  conditions  of  I\  .il.2.c  do  not  affect  D.  The  second  conflict 
in  level  2,  which  distinguishes  IV'.B.2.c  from  FV  .F4.2.b,  is  shown  in  Figure  8.5(c). 
Figure  S.5(d)  shows  the  result  if  C  is  selected  to  transit  the  box  first. 

If  a  single  message  is  present  at  the  inputs  of  the  composite  node  it  will 
transit  in  two  time  steps  Tlie  time  is  two  because  there  are  two  levels.  Thus, 

l’(t  ~2|  m  =  l)  =  1  . 

Therefore, 

f:(t|  m  =  l)  =  2 


'I'he  following  facts  and  nictation  are  established  for  convenience  in  the 
analysis  for  m  =  2.  3,  and  i  There  are  |'jj  ways  two  messages  can  be  arranged 

at  the  inputv  of  a  1  X  I  ''Witrtung  element,  and  two  of  these  involve  having  two 
messages  at  the  iii|uits  of  a  level  1  box  for  the  composite  switch.  Thus, 

2  1 

r(a  level  I  box  has  2  messages!  m-2)  =  ~ 


1 


It  is  clear  that 


r(a  level  I  box  has  2  mess,ages  m-3)  —  1 


and 


OQ  M> 
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iir«*  5  Message  movement  in  a  composite  node  for  Table  8.1  entry 
IV.B.2.C  Messages  are  indicated  by  the  letters  A,  B,  C,  and  D. 

(a)  Initial  message  position.  A  and  B  conflict  in  level  1. 

(b)  Position  after  one  time  step.  A  and  C  conflict  in  level  2. 

(c)  Position  after  two  time  steps.  B  and  C  subsequently  conflict 
in  level  2.  (d)  F’osition  after  three  lime  steps. 
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P(a  level  1  box  has  2  messages]  m  — 4)  —  1  . 

Because  two  of  four  possible  interchange  box  connections  which  may  be  needed 
by  the  messages  involve  conflict,  and  since  destinations  arc  random  and 
uniformly  distributed, 

P( conflict  in  an  interchange  box)  =  P(no  conflict  in  an  interchange  box) 

_  X 

o 

The  following  notation  is  used  in  the  ensuing  equations,  where  i  =  1  or  2: 

1.  l’(iU)  =  P(no  conflict  in  level  i  upper  box]  2  messages  at  □  inputs) 

_  X 
2  ’ 

2.  F’(iL)  =  F’(no  conflict  in  level  i  lower  box]  2  messages  at  □  inputs) 

_  X 
2  ’ 

3.  P(iX)  =  —  ,  where  X  =  P  or  L;  and 

4.  I’(i)  =  I^(no  conflict  in  level  i|  m=4)  =  P(iP)  *  P(iL)  -  ~  • 

4 

8.5.2  Passing  Two  Messages 

For  m  2,  transit  times  of  two  or  three  are  possible.  A  time  of  two  occurs 
only  if  there  is  no  conflict  This  corresponds  to  entry  11. A. 1  of  Table  8.1. 
There  are  three  ca.'^es  to  consider.  First,  if  no  box  receives  two  messages 


m 


m- 


■w--,  { 


j*' ] 


simultaneously  then  conflict  is  impossible,  so 


P(t  =  2,  case  l[  m  =  2)  =  P(no  box  in  any  level  receives  2  messages]  m  =  2) 
=  P(no  l«vel  1  box  has  2  messages]  m=2)  * 

P(no  level  2  box  has  2  messages]  m=2, 


no  level  1  box  has  2  messages) 


2 


3 


Note  that 


I’(no  level  2  box  has  2  messages]  m=2,  no  level  1  box  has  2  messages)  =  — 


because  there  are  four  ways  two  non-distinct  messages  can  arrive  at  the  inputs 
of  level  2  boxes,  and  only  two  of  the  ways  involve  two  messages  at  the  inputs 
of  a  level  2  box.  (See  Figure  8.1  for  aid  in  visualizing  this.)  Next,  allow  a  level 
1  bf>x  to  receive  two  messages  which  do  not  conflict.  There  can  be  no  conflict 
i!i  lew.'!  2  in  this  case  since  nf)  level  2  box  will  have  two  messages.  So, 

i'(t  -2.  case  2j  m“2)  =  P(.h  l<*vel  1  box  has  2  messages  but  no  conflict]  m  =  2j 


-  P(a  level  1  box  has  2  messages  m— 2)  *  P(lX) 
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Finally,  let  a  level  2  box  receive  two  non-conflicting  messages.  Then, 

P(t=2.  case  3[  rn=2)  =  P(a  level  2  box  has  2  messages  but  no  conflictl  m~2) 


—  I’(no  level  1  box  has  2  me,ssagesi  ni-2)  ♦  P(2X)  * 


P(a  level  2  box  has  2  messages  m  — 2, 


no  level  1  box  has  2  messages) 


,  I  1  I  _  1 

1 - * — * —  _  — 

3  2  2  6 


Note  that 


P(a  level  2  b<ix  has  2  messages  m-2.  no  level  1  box  has  2  messages) 


1  -  P(no  level  2  box  has  2  mes.sages  m  — 2,  no  level  1  box  has  2  messages) 


NN'ith  ni  “  2  it  is  impossible  to  have  both  a  level  I  and  level  2  bc>x  receive  two 
message's  sinnilt  aneously.  Thus,  these  three  cases  cover  all  possible  ways  to 
have  no  conflict  with  m  =2.  Summing  the  probabilities  of  the  cases  yields  the 
desired  result.  Thus, 


I  I  I 

(t=2  m-2)  = 

-  3  6  6  3 


A  transit  time  of  three  (entries  11. A. 2  and  II. P  of  Table  8.1)  occurs  if  there 
IS  conflict  With  m  =2  conflict  can  occur  in  either  level  I  or  level  2,  but  not 


P(t=3jm=2)  =  P(contjict  m-2) 


=  P(ooriflict  in  level  l[  m=2)  +  P(conflict  in  level  2[  in=2) 

=  [l-P(lX)]»P(a  level  1  box  has  2  messages]  m=2)  +  [1-P(2X)]* 
P(a  level  2  box  has  2  messages]  m  =  2,  no  conflict  in  level  1) 


1-1 

*1  P 

i-J- 

2  ^ 

3 

2 

T'herefore, 

3  I 

IC(t|  m  =  2)  =  i  *  P(t=iJ  m=2)  =  2  —  , 

i  =  2 
and 

V  P(tz:i|m=2)  =  I  +  =  1  , 

as  required. 

8.5.3  Passing  Three  Messages 

1  f>r  rn  =3,  transit  times  ranging  from  two  to  four  are  possible.  Clearly,  to 
a^hif'^e  a  time  of  two  no  conflict  may  occur  (Table  8.1  entry  ID.A.l).  So, 

l’(t=2!  ni=3)  =  P(no  conflict]  m=3) 

-  P(  IX)  *  P(a  level  1  box  has  2  messages]  m=3)  ♦  P(2X)  * 

P(a  Irvid  2  box  has  2  me.ssages]  rn=3,  no  conflict  in  level  1) 
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Note  that 

P(a  level  2  box  has  2  messagesj  m=3,  no  conflict  in  level  1)  =  1 

because  three  messages  must  arrive  simultaneously  at  level  2,  given  m  =  3  and 
no  conflict  occurred  in  level  1. 

A  time  of  three  can  occur  in  three  ways  for  m=3.  Assume  first  that 
conflict  occurs  only  in  level  I  (Table  8.1  entry  IIl.B.l).  Because  first  two 
messages  reach  level  2  simultaneously  and  then  a  single  message, 

P(t  =  3,  case  ij  m=3)  =  P(conflict  in  level  1  onlyj  m=3) 

=  P(conflict  in  level  l|  m=3)  • 

P(no  conflict  in  level  2|  m=3, conflict  in  level  1) 

_  1  3  _  3 
2*4  8  ' 

The  term  f’(no  conflict  in  level  2j  m=3,  conflict  in  level  1)  in  the  above 
equation  must  be  used  instead  of  P(no  conflict  in  level  2|  m=3)  because  the 
level  1  outputs  used  by  two  messages  arriving  at  level  2  have  an  effect  on  the 
probability  of  conflict  in  level  2  For  example,  if  the  two  messages  come  from 
the  same  level  1  box  (implies  no  conflict  in  level  1)  then  conflict  in  level  2  is 
impossible  since  the  messages  must  use  different  level  2  boxes.  For  this  case  of 
the  analysis  conflict  occurs  in  level  1,  so  the  two  messages  lirst  arriving  in  level 
2  must  come  from  distinct  level  1  boxes.  The  probability  these  messages  do 
conflict  is 
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P{they  arrive  at  the  same  level  2  box)  •  P(2X)  =  ~  *  "T  —  ~ 

2  2  4 


1  3 

P(no  conflict  in  level  2|  m=3,  level  1  conflict)  ~  ^  ~ 

F'or  conflict  in  level  2  only  (Table  8.1  entry  III.A.2), 

P(t=3,  case  2j  m=3)  =  P(no  conflict  in  level  l]  m=3)  * 

P(conflict  in  level  2|  m=3) 


_ 

2*2  4 


fin  ally,  a  time  of  three  occurs  if  there  is  a  single  conflict  in  level  1  and  a  single 
c  nflict  in  level  2  without  any  arbitration  delay  or  subsequent  conflicts 
(Table  8  1  entry  I]l.B.2.a).  Thus, 

P(t=3,  case  3j  m=3)  =  P(conflict  in  level  l|  m=3)  • 

P(conflict  in  level  2|  m=3,  conflict  in  level  1)  * 


\  •  P(2X) 


=  1*  i_  1  *1*1  =-L 

2  4  2  2  32 


1  he  flrst  two  factors  of  this  equation  represent  the  probability  of  having  both  a 
level  1  and  a  conflict  in  level  2  with  m=3.  The  y  factor  is  the  probability 

that  the  conflict  in  level  2  is  resolved  without  resulting  arbitration  delay.  The 
P(2X)  factor  is  the  probability  the  last  two  messages  to  exit  the  node  do  not 
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conflict  in  level  2.  Combining  the  above  yields 


P(t-3[  m-3)  -  V)  P(t-3,  case  i|  m-3)  -  T  +  7  + 

•  —  j  o  4  oZ  o 


A  time  of  four  occurs  when  there  is  conflict  at  both  levels  with  an 
arbitration  delay  or  subsequent  conflict  (Table  8.1  entries  III.B.2.b  and 
in.B.2.c).  Thus, 

P(t  =  4|  m=3)  =  P(conflict  in  levels  1  and  2|  m=3) 


=  P(conflict  in  level  1  m-3)  * 


P(conflict  in  level  2  m=3,  conflict  in  level  1)  ♦ 


1  +  7  P(2X) 


=  1  +  1*1  =  A 


4  2  2  2  32 


P(conflict  in  level  2  m=3,  conflict  in  level  1) 


1  -  P(no  conflict  in  level  2j  m=3,  conflict  in  level  1)  -  1 - ,  as  appears  in 

4 

the  equation  for  f^(t=3,  case  l|  m=3). 

In  conclusion, 


V  P(t=i|  m=3)  =  +  1  =  1  , 

'  488 
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and 

E(t|  m=3)  =  E  *  *  P(t='|  =  2  |^  • 

8.6.4  Passing  Four  Messages 

Now  consider  the  probabilities  of  different  times  to  pass  four  messages 
through  the  composite  node.  Only  two  time  steps  are  needed  when  there  is  no 
conflict  (Table  8.1  entry  IV.A.l).  So, 

P(t=2lm=4)  =  P(1U)*P(1L)*P(2U)*P(2L)  =  . 

There  are  6ve  cases  to  consider  for  a  time  of  three.  First  assume  no 
conflicts  occur  in  level  1  (Table  8.1  entry  IV.A.2).  Then  one  or  two  conflicts 
must  occur  in  level  2.  Thus, 

P(t=3,  case  l|  m=4)  =  P(l)  *(1-P(2))  =  —  . 

Next,  assume  exactly  one  level  1  interchange  box  has  a  conflict  and  there 
is  no  sub'^cqucnt  conflict  in  level  2  (Table  8.1  entry  IV.B.l.a).  Hence, 

P(t: -.3,  case  2|  in=4)  =  (( 1-P( lU))*P(lL)  +  P(lUHl-P(lL))l*P(2X)  =  j  . 

In  the  above  expression,  P(2X)  is  used  because  one  of  the  level  2  boxes  must 
process  three  messages  (one  arriving  later)  while  the  other  processes  just  one. 
The  former  must  be  conflict  free,  and  the  latter  has  no  constraint. 

Next,  consider  conflict  at  both  levels  (Table  8.1  entry  rV.B.2.a).  Assume 
the  four  ;nes.sage.s  are  A,  B,  C,  and  D,  and  there  is  one  conflict  in  level  1  for 
‘  x ample,  bet  veen  A  and  B,  scj  all  but  message  B  go  to  level  2  after  one  time 
^tep  Also  assume  there  is  one  conflict  in  level  2,  between  A  and  C  (or  D). 
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First,  depending  upon  which  message,  A  or  C,  is  chosen  by  the  arbitration 
logic,  B  may  find  it  must  wait  for  A  even  if  its  output  request  differs  from  that 
of  A,  or  for  C  if  B  and  C  do  request  the  same  output.  Second,  if  A  is  chosen 
to  transit  level  2  before  C,  and  B  and  C  request  different  output  lines,  then  B 
will  not  be  delayed  at  level  2.  The  former  situation  requires  four  time  steps, 
and  the  latter  (Table  8.1  entry  r\'.B.2.a)  requires  three.  Figure  8.6  illustrates 
one  instance  of  the  latter  situation.  The  probability  of  the  latter  is 

P(t  =3,  case  3|  m-4)  —  [(l-P(lU))  *  P(1L)  +  P(1U)*(1-P(1L))1  * 


(l-P(2X))*-^*P(2X) 


The  first  factor  is  the  probability  that  exactly  one  level  1  box  has  a  conflict. 
The  next  factor  is  the  probability  that  the  first  message  from  the  level  1  box 

which  had  a  conflict,  call  this  message  M,  also  has  a  conflict  in  level  2.  The  — 

2 

is  the  probability  that  M  will  be  chosen  to  transit  level  2  first.  The  last  factor 
is  the  probability  that  the  two  delayed  messages  (B  and  C  in  the  example)  do 
not  conflict. 

Now  assume  conflict  in  both  level  1  boxes  and  that  both  level  2  boxes 
receive  messages  (this  happens  half  the  time  there  are  two  conflicts  in  level  1). 
Some  of  the  instances  of  Table  8.1  entry  rV'.B.l  b  fit  this  assumption.  There  is 
no  possibility  of  conflict  in  level  2  boxes,  so 

P(t  =  3.  case4|rn  =  4)  =  y*(  1  ~  P(  ir))*(  1  “  P(lP)'  -  “  ■ 


mi.  i 


Figure  8.6  Message  movement  in  a  composite  node  corresponding  to  one 
element  in  the  set  of  message  flow  patterns  covered  by  P(t=3, 
case  3  |  m=4).  (a)  Initial  message  position.  A  and  B  conflict  in 
level  1.  (b)  Position  after  one  time  step.  A  and  C  conflict  in 
level  2.  (c)  Position  after  two  time  steps.  No  remaining  conflicts. 
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Finally,  assume  conflict  in  both  level  1  boxes  but  only  one  level  2  box 
receives  messages  and  there  is  no  conflict  for  either  message  pair  that  transits 
that  box.  This  describes  the  rest  of  the  instances  of  Table  8.1  entry  IV.B.l.b. 
Then 

P(t=3,  case5|m=4)  =  -*(1 -P(1U))*(1 -P(1L))*P(2X)*P(2X)  =  —  . 

2  32 

The  probability  that  all  messages  transit  the  composite  node  in  three  time 
steps  is 

5 

P(t=:3j  m=4)  =  V  P(t  =  3,  case  i|  m  =  4) 
i=i 

=  A  +  i  +  JL  +  1 _L  =  21 

16  4  16  8  32  32 

For  a  time  of  four,  there  are  four  cases.  One  involves  the  situation  of  the 
third  case  for  a  time  of  three.  There  are  two  ways  to  obtain  a  time  of  four  in 
that  situation;  (1)  message  B  enters  a  non-empty  queue  in  level  2  (Table  8.1 
entry  rV.B.2.b  (see  Figure  8.4))  and  (2)  B  enters  an  empty  queue  but  conflicts 
with  the  remaining  message  (C  or  D)  (Table  8  1  entry  rV'.B.2.c  (see 
Figure  8.5)).  Thus, 

P(t  =  4,  case  l|m  =  4)  =  ((I  - P(lU))*P(lL)  +  P(lU)*(l -P(1L))1  * 

[|*(1-P(2X))  +  |*(I-I‘(2X))*(1-P(2X))] 

_  _3_ 

16 

The  first  factor  is  the  probability  of  confliet  in  exactly  one  level  1  box.  The 
first  term  of  the  second  factor  is  the  probability  B  enters  a  non-empty  queue. 


This  term  has  two  factors:  (1-P(2X))  is  the  probability  of  the  conflict  in  level 
2  necessary  to  force  a  message  to  remain  at  the  input  of  the  level  2  box  B  will 

use  until  B  arrives,  and  —  is  the  probability  that  the  message  is  at  the  level  2 

2 

box  input  B  will  use.  The  second  term  of  the  second  factor  is  the  probability 
that  B  enters  an  empty  level  2  box  input  but  conflicts  with  the  other  message 
at  that  box.  Again,  (1-P(2X))  is  the  probability  of  the  first  conflict  in  level  2. 

The  —  IS  the  probability  B  arrives  at  the  empty  level  2  box  input.  The  second 
2 

(1  ~P(2X))  is  the  probability  B  conflicts  with  the  remaining  message  (C  or  D). 

Now  assume  conflict  in  both  level  1  boxes  and  that  only  one  level  2  box 
receives  messages  (this  happens  half  the  time  there  are  two  conflicts  in  level  1). 
There  are  then  three  ways  (cases  2,  3,  and  4)  a  time  of  four  can  occur.  In  case 
2.  the  first  two  messages  reaching  the  box  in  level  2  conflict,  but  there  are  no 
subsequent  conflicts  (Table  8.1  entry  IV.B.S.b),  so 

P(t  =  4,  case  2j  m  =  4)  =  i*(i-P(iU))*(l-P(lL))*(l-P(2X))*P(2X)  =  . 

*  U* 

The  factor  is  the  probability  that  only  one  level  2  box  receives  messages 

given  there  are  conflicts  in  both  level  1  boxes,  denoted  by  the  factors 
(i  I  jir))  and  (l-P(lL))  The  probability  that  the  first  two  messages  to 
reach  level  2  conflict,  but  there  are  no  subsequent  conflicts,  is 
(I  -P(2X))*P(2X). 

In  case  3,  the  first  pair  of  m''ssages  arriving  at  level  2  do  not  conflict,  but 
the  second  pair  do  (Table  8  1  entry  FV'  B  3  a),  yielding 

I’(t:v,i,  case  3|  m  -=  l)  =  ^ •(  1-B( B'))*(  «-B| IB l)*P(2X)*(  1-P(2X))  =  —  . 

J  32 

This  equalK>n  is  derived  in  a  similar  way  ixy  that  for  P(t-4,  case  2|  m  =  4). 
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Figure  8.7  is  provided  as  an  aid  to  understanding  the  message  movement 
corresponding  to  case  4.  In  this  case,  the  first  pair  of  messages  arriving  at  level 
2,  A  and  C,  conflict.  A,  for  example,  transits  and  C  remains  while  B  and  D 
arrive  on  distinct  inputs.  D  is  behind  C  in  the  queue.  Assume  there  is  a 
second  conflict  between  B  and  C.  Depending  upon  which  (B  or  C)  is  chosen  by 
the  priority  logic  (either  is  equally  likely),  either  (1)  B  and  D  are  at  the  front  of 
the  queues  or  (2)  C  and  D  are  in  the  same  queue.  To  obtain  a  time  of  four,  B 
and  D  must  be  at  the  front  of  the  queues  and  not  conflict  (Table  8.1  entry 
rV.B.S.c),  otherwise  a  time  of  five  results.  Hence, 

P(t  =  4,  case  4lm  =  4)  =  -  *  (l-P(lU))  *  (l-P(lL))  *  (1-P(2X))  * 

(I-P(2X))*  *P(2X) 

_  _l_ 

128 

The  first  four  factors  of  the  equation  appear  for  the  same  reason  as  the  first 
four  factors  in  the  expression  for  P(t  =  4,  case  2|  m=4).  The  fifth  factor, 

(1-P(2X)),  is  the  probability  that  B  and  C  conflict  as  assumed.  The  is  the 

probability  that  C  is  chosen  to  resolve  the  conflict  so  that  B  and  D  are  ready  to 
transit  level  2  on  the  next  time  step.  They  will  do  so  if  they  do  not  conflict. 
This  has  a  probability  of  P(2X),  the  last  factor.  The  probability  of  a  time  of 
four  is 
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Figure  8.7  Message  movement  in  a  composite  node  corresponding  to  one 
element  in  the  set  of  message  flow  patterns  covered  by  P(t=4, 
case  4  j  m=4).  (a)  Initial  message  position,  (b)  Position  after 
one  time  step,  (c)  Position  after  two  time  steps,  (d)  Position 
after  three  time  steps. 
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P(t=4j  m=4)  =  P(t=4,  case  i|  m=4) 

i  =  l 

=  A  +  JL  +  _L  + 

16  32  32  128  128 

Finally,  there  is  only  one  way  a  time  of  6ve  can  occur  and  that  begins 
with  the  situation  of  case  four  above,  where  a  time  of  four  or  five  can  occur. 
When  B  and  D  are  at  the  front  of  the  queues  and  conflict,  or  C  and  D  are  in 
the  same  queue  (Table  8.1  entries  IV.B.3.d  and  rV.B.3.e),  t  =  5.  So, 

P(t  =  5|m=4)  =  -*(1-P(1U)H1-P(1L)H1-P(2X))* 

(1-P(2X))*|*(1-P(2X))  +  (1-P(2X))*| 

_ 

128 

The  first  four  factors  of  this  equation  appear  for  the  same  reason  as  the  first 
four  factors  in  the  expressions  for  P(t  =  4,  case  2|  m  =  4)  and 
P(t  =  4,  case4|m==4).  The  last  factor  consists  of  two  terms.  The  first  is  the 
probability  B  and  C  conflict  and  C  was  chosen  to  resolve  the  conflict,  setting 
the  stage  for  B  and  D  to  use  a  level  2  box  simultaneously  and  conflict.  The 
second  term  is  the  probability  B  and  C  conflict,  but  C  was  not  chosen,  forcing 
D  to  follow  C  through  the  same  level  2  box  input. 


F](t  m=4) 


5 

V  i  *  =  m=4) 


=  3 


31 

128  ' 


Therefore, 
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and 


E  P(t=i| 

1=2 


n>=4)  =  -jj 


33 

128 


+ 


3 

128 


=  1  . 


8.6.5  Permutation  Delay  Probability 

As  in  Section  8.4,  the  preceding  analysis  is  most  applicable  to  MIMD 
environments.  An  analysis  for  the  composite  node  in  SIMD  mode,  analogous  to 
that  given  for  the  crossbar  node,  can  be  performed.  The  composite  node  can 
perform  2^  =  16  permutations.  Thus, 

04 

P(delay|  m  =  4)  =  1  -  P(no  conflict]  m  =  4)  =  1 - -  =  0.938  . 

4* 

Th  us,  93.8  percent  of  the  time  four  messages  try  to  pass  through  a  composite 
node  simultaneously  there  will  be  conflict,  assuming  random  node  output 
selection  with  uniform  distribution.  Again,  P(delay|m  =  4)  is  likely  to  be  of 
limited  use  for  understanding  network  level,  as  opposed  to  switching  element 
level,  behavior. 
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8.6.6  Summary 

The  results  of  the  analysis  for  the  two  nodes  are  summarized  in  Table  8.2.  ^7  '  ■ 

An  obvious  difference  between  the  two  implementations  is  that  twice  as  much 
time  is  required  to  transit  the  composite  node  as  the  crossbar  node  when  there 
Ls  no  contention.  A  more  subtle  difference  in  their  blocking  can  be  observed  by  ] 

comparing  the  incremental  difference  in  expected  delay  that  each  exhibits  as  m 
is  varied.  This  is  the  discrete  derivative  of  delay  and  gives  an  indication  of  the 
degree  of  conflict  as  a  function  of  m.  ini’: 

.1 
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Table  8.2  Summary  of  transit  time  performance  and  incremental  delay 
characteristics  for  the  crossbar  and  composite  nodes. 


Number  of 

Messages  (m) 

Crossbar 

Composite 

Expected 

Transit 

Time 

E(t|m) 

Incremental 

Difference 

Expected 

Transit 

Time 

E(t  |m) 

Incremental 

Difference 

1 

1 

- 

2 

- 

2 

1.25 

0.25 

2,33 

0.33 

3 

1.69 

0  44 

2.84 

0.51 

4 

2.13 

0  44 

3.24 

0.36 

Total  Difference 

1.13 

l.?1 

Total  Difference 


1.13 
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In  the  crossbar  node  the  incremental  difference  in  delay  is  a  nondecreasing 
sequence  with  increasing  m.  For  the  composite  node  there  is  a  noticeable 
hump.”  Increasing  m  from  two  to  three  produces  a  greater  increase  in  E(t|  m) 
than  an  increase  from  three  to  four.  This  can  be  explained  in  the  following 
way  When  m  =  2  there  Ls  a  two-thirds  probability  that  the  two  messages  enter 

different  level  1  interchange  boxes,  since  there  are  [2]“®  ways  two  messages 

an  occupy  the  four  level  1  inputs  and  four  of  these  ways  do  not  involve  two 
messages  at  a  single  level  1  box.  This  reduces  the  chance  for  conflict  in  level  1 
Id  the  composite  node.  However,  when  m  =  3  one  level  1  box  must  have  two 
messages  to  process,  increasing  the  chance  for  conflict  in  the  node  significantly. 
A  smaller  increase  occurs  for  m  =  4,  because  it  is  then  possible  for  two  conflicts 
to  occur  in  the  same  level. 

Node  level  behavior  can  provide  insight  into  overall  network  behavior.  In 
a  mult ist. age  cube-type  network,  the  average  delay  should  increase  more  rapidly 
a.s  a  message  loading  threshold  is  reached  and  then  taper  off  as  full  loading  is 
reached  On  the  other  hand,  a  complete  crossbar  network  of  the  same  size 
.'hould  exhibit  a  nondecreasing  sequence  of  incremental  increases  in  delay. 
rhc>e  pre  iioti  )us  should  of  course  be  verified  by  simulation. 

finally  it  can  be  observed  that  t..e  total  increase  in  delay  b  more  for  the 
composite  nc>de  This  difference  ls  due  to  the  fact  that  the  composite  node  b  a 
!  I  n  king  network,  while  the  crossbar  b  not.  The  composite  node  introduces 
conflict  above  and  beyond  that  due  to  coincident  message  destinations. 


0^' 
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8.6  Network  Implementation  and  Performance 


An  ESC  network  can  be  constructed  from  4x4  switching  nodes  using  the 
following  method.  Needed  are  (n-l)/2  stages  of  N/4  4x4  switching  nodes 
each,  for  n  odd.  The  input  and  output  stages  must  be  implemented  with 
interchange  boxes  as  usual.  In  the  ESC  topology,  at  stage  i,  0  <  i  <  n,  the  two 
inputs  to  an  interchange  box  have  addresses  that  differ  only  in  the  i'^**  bit 
position.  Let  n  be  odd  (a  way  of  using  4x4  switching  nodes  with  n  even  will 
be  described  later).  Either  append  0  to  the  right  of  the  least  significant  bit  of 
the  binary  representation  of  all  labels,  or  append  1  This  allows  the 
appropriate  bits  of  the  binary  representation  to  be  paired  during  conversion  to 
base  four  notation.  Convert  all  link  labels  to  base  four  representation.  For 
purposes  of  the  construction  method  description,  let  the  (n-  l)/2  stages  of  N/4 


4x4  switches  bo  numbered 


-I,  ...,  1,0  (from  input  to  output).  At 


stage  i,  0<i<(n-l)/2,  the  four  inputs  to  a  4  x  4  switching  node  must  have 
labels  that  differ  only  in  the  i*^^  position  of  their  base  four  representation.  The 
link  with  a  0  in  the  i^^  position  of  its  address  connects  to  the  top  input  of  the 
4x4  node,  2  to  the  next  input,  1  to  the  next  input,  and  finally  3  to  the  bottom 
input  [SmiSl].  The  outputs  of  the  4x4  switching  nodes  have  the  same  labels 
as  the  input  lines,  but  in  increasing  order,  i.e.,  the  top  output  label  has  a  0  in 
the  i‘*’  position,  next  1,  next  2,  and  the  bottom  3.  Figure  8.8  shows  a 
composite  node  labeled  as  described.  The  bits  of  the  corresponding  binary 
labeling  for  the  constituent  boxes  are  shown  internal  to  the  node.  When 


composite  nodes  are  used,  making  connections  in  the  .above  manner  creates  an 
E.SC  netwf)rk.  When  crossbars  nodes  are  used,  a  network  i.s  created  whose 
capabilities  are  a  superset  of  those  of  the  ESC  network  Figure  8.9  shows  how 
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4x4  switches  can  be  used  in  the  construction  of  an  ESC,  or  ESC-Iike,  (if 
crossbar  switches  are  used)  network  with  N  =8. 

The  ESC  network  provides  for  individual  box  control.  Because  there  are 
N(n  +  l)/2  interchange  boxes  in  an  N-input  network,  but  stage  n  is  normally 
bypassed,  there  are  2^"/^  distinct  network  settings.  Each  of  these  settings 
corresponds  to  a  unique  permutation  of  network  inputs  to  outputs.  Since  an 
ESC  network  built  from  composite  nodes  results  in  a  network  identical  to  one 
built  from  interchange  boxes,  the  number  of  permutations  performable  using 
either  implementation  is  the  same. 

If  crossbar  nodes  are  used,  the  number  of  performable  permutations  is 
greater  than  with  the  composite  nodes.  This  is  because  each  crossbar  node  has 
4!  =  24  settings  which  are  permutations.  As  with  the  interchange  box 
implementation,  the  network  performs  a  permutation  of  inputs  to  outputs  if 
and  only  if  each  crossbar  node  is  set  to  a  one-to-one  connection  of  inputs  to 
outputs.  A  network  with  N  such  that  n  is  odd  can  be  constructed  from  4x4 

crossbar  nodes,  except  for  the  input  and  output  stages.  Thus, 

4x4  crossbar  nodes  are  used.  Such  a  network,  with  stage  n  bypassed,  can 
perform  permutations.  These  permutations  are  unique  since 

there  is  only  one  path  from  a  given  input  to  a  given  output.  If  one  switching 
node  setting  is  changed,  there  is  no  other  switching  node  whose  setting  can  be 
changed  to  yield  the  original  permutation.  Table  8.3  compares  the  number  of 
permutations  performable  with  composite  and  crossbar  nodes. 

When  N  is  such  that  n  is  even,  then  one  way  the  network  can  be 
implemented  is  to  use  4x4  crossbar  or  composite  nodes  for  all  but  three  stages. 
Any  one  stage  i  in  the  original  ESC  network,  where  i  is  odd,  can  be 


n-1 

_N 

2 

4 

Table  8.3  Comparison  of  permuting  capabilities  of  ESC  networks 
implemented  with  the  maximum  number  of  composite  and 
crossbar  nodes  (stage  n  bypassed)  for  vari.jus  values  of  N. 


N 

Number  of  Permutations 

Composite 

Crossbar 

4 

16 

16 

8 

4096 

9216 

16 

4.29x10® 

2.17x10*° 

32 

1.21x102< 

7.94x102° 

64 

6,28x10^^ 

2.71x10°’ 

128 

7  27x10'^^ 

5.84x10'^* 

256 

1.80x10^°* 

1.16x10’^ 

512 

3.74x10®®’ 

SSOxlO^®® 

1024 

1  88x10'^^' 

3.90x10''2' 
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constructed  with  interchange  boxes  or  with  4x4  crossbar  nodes  which  are 
limited  to  performing  only  a  single  cube  function.  The  link  iabeb  should  be 
converted  to  base  four  to  specify  the  4x4  node  connections  after  a  0  (or  1)  has 
been  inserted  immediately  to  the  right  of  bit  position  i  in  the  binary 
representation  of  the  link  labels.  This  allows  the  appropriate  bits  to  be  paired 
during  conversion  to  base  four  notation.  Again  for  construction  purposes,  the 

-I  ages  of  4  X  4  switches  can  this  time  be  numbered  —  -1,  ...,  1,  0.  Stage  ^ 

2  2 

in  this  new  numbering  scheme  is  the  stage  implemented  with  interchange  boxes 
or  limited  4x4  crossbar  nodes.  For  n  even  and  with  stage  n  bypassed,  the 
network  can  perform  unique  permutations  if  crossbar  nodes  are 

used,  and  2^"^^  with  composite  nodes,  as  before. 


8.7  Conclasions 

The  crossbar  node  is  always  faster  at  passing  messages  than  the  composite 
ru'dt'  !f  the  connection  requests  do  not  conflict  in  the  composite  node,  the 
<  ros  oar  i-,  twKe  as  fast.  When  the  connection  requests  of  the  messages  at  a 
giv  en  wit(  lung  element  form  a  permutation  that  a  composite  node  cannot  pass 
wahoiii  conflict,  but  that  a  crossbar  can,  it  takes  three  times  longer  for  all 
rues-. ages  to  exit  the  composite  node.  Finally,  as  the  number  of  messages 
iiicrea.ses  the  degree  of  conflict  in  the  composite  node  rbes  more  rapidly  than  in 
the  cressbar  node 

The  performance  of  basic  building  blocks  for  cube-type  multistage 
iiiterccnnectiofi  networks  has  been  examined.  The  4-input/4-output  crossbar 
n  .  Ic  hri:^  been  compared  to  a  4-input/4-output  composite  node  constructed 
from  2-input /2-output  interchange  boxes  Capabilities  of  these  two  switching 


element  design  approaches  were  quantiGed.  Implementation  of  an  ESC,  or 
ESC-like,  network  using  4-input/4-output  switches  was  discussed.  The  results 
presented  give  designers  of  parallel  computer  systems  systems  additional 
information  to  apply  to  interconnection  network  design. 
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CHAPTER  0 


STUDY  OF  IMAGE  CONTOUR  EXTRACTION 
WITH  NETWORK  AND  SYSTEM  IMPLICATIONS 

0.1  Introduction 

Digital  image  processing  has  long  been  recognized  as  well-suited  for 
parallel  processing.  Many  individual  image  processing  algorithms  and  their 
formulations  as  parallel  algorithms  have  been  studied,  including,  for  example, 
image  coding  [MDS82],  correlation  [Ack77,  SSF82],  histogramming  (SSK81], 
line  segment  generation  (Sta74l,  resampling  (WaS82].  segmentation  [Dou82], 
and  two-dimensional  FFT  [MSSSOb],  Each  part  of  this  body  of  work 
illuminates  some  aspect  of  image  processing  with  a  parallel  processing  system. 

The  whole  can  sometimes  be  greater  than  the  sum  of  its  parts.  Similarly, 
study  of  an  image  processing  scenario,  a  task  consisting  of  multiple,  basic 
image  processing  algorithms  such  as  those  above,  can  yield  insight  not 
forthcoming  from  consideration  of  the  elements  individually.  A  succession  of 
processing  steps  is  more  characteristic  of  actual  machine  u.se  than  the  load 
associated  with  any  single  algorithm.  The  interaction  between  image 
processing  steps  may  well  influence  the  scenarkdevel  processing  method  of 
choice. 


Portions  of  this  chapter  were  developed  with  O.avid  I,  Tuomenoksa.  James  T  Kuehn. 
O.  Robert  Mitchell,  and  Kirk  Dunkelherger 


As  stated  in  Chapter  2,  Section  2.2,  PASM  is  intended  to  be  suited  for 
several  applications,  including  image  processing,  and  will  be  designed  to  meet 
that  goal.  The  work  in  this  chapter  contributes  to  that  effort.  The  nature  of 
the  anticipated  task  load  is  a  significant  consideration  in  the  selection  of  an 
interconnection  network  for  a  parallel  processor  as  well.  The  capabilities  of  the 
r:^SC  and  its  place  among  other  fault-tolerant  multistage  interconnection 
netwijrk.s  have  been  e.stablished.  The  next  logical  step  in  a  study  of  the  ESC  is 
to  ascertain  its  utility  for  performing  the  communication  role  in  a 
multiprocessor  computer  system,  in  particular,  PASM. 

Contour  extraction  is  the  focus  of  this  chapter;  it  was  selected  for  two 
reasons.  First,  contour  extraction  is  a  key  image  processing  step  in 
applications  ranging  from  computer-assisted  cartography  to  industrial 
inspection  jChH82,  JarSO].  Digital  image  processing  will  continue  to  increase  in 
importance  for  these  applications.  Second,  contour  extraction  presents  a 
multifaceted  challenge  to  a  parallel  computer;  hence,  it  illuminates  a  number  of 
issue.^.  in  parallel  processor  design  [TAS83j. 

0.2  Contour  Extraction 

C  on  tour  extraction  consists  of  two  operations:  segmentation  and  contour 
tracing  .Segmentation  divides  an  image  into  various  regions;  the  goal  is  for 
these  regions  to  correspond  closely  with  the  depicted  background  and  with 
(  lijects  of  interest.  Contour  tracing  identifies  the  boundaries  of  regions. 
Object  boundaries  are  the  items  interest. 

7'here  are  many  ways  to  accomplish  segmentation  fRiA77|.  One  uses 
thresholding,  a  technique  by  which  an  image  is  mapped  onto  two  brightness,  or 
gray  levels  Typically,  all  portions  of  an  image  at  or  below  a  brightness 
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threshold,  or  level,  are  set  to  black,  and  all  above  are  set  to  white. 
Determining  a  single  threshold  that  yields  a  useful  segmentation  for  a  given 
image  can  be  difficult;  this  is  particularly  true  when  the  average  brightness 
level  varies  significantly  among  various  areas  in  the  image.  It  may  be  that  no 
satisfactory  single  threshold  exists  for  a  given  image.  Thresholding  can  be 
generalized  to  using  any  well  defined  image  feature,  and  not  just  pixel  gray 
levels. 

If  an  image  is  divided  into  sufficiently  small  subimages,  the  lighting  level 
will  be  approximately  uniform  within  a  given  subimage  despite  significant 
image-wide  variation.  Thresholding  independently  within  each  subimage  can 
succeed  in  capturing  objects  of  interest  throughout  the  image  when  a  single, 
global  threshold  will  fail. 

Contour  tracing  algorithms  depend  upon  the  topological  properties  of 
adjacency  and  connectedness  for  contours.  Let  (x,y)  be  the  coordinates  of  a 
pixel.  The  four  horizontal  and  vertical  neighbors  of  (x,y),  namely  pixels  at 
(x-l.y),  (x,y-l),  (x,y  +  l),  and  (x  +  l,y)  are  the  .^-neij;/i6ors  of  (x,y)  and  are  4' 
adjacent  to  (x,y)  (HoK82].  Similarly,  including  the  diagonally  adjacent 
neighbor  pixels  at  (x-l,y-l).  (x-l,y  +  l),  (x  +  l,y-l).  and  (x+l,y-fi)  with  the 
4-neighbors  of  !x,y)  yields  the  8-neighbors  of  (x,y),  which  are  8-adjacent  to 
(x.y).  The  pixels  of  objects  of  interest  are  assumed  to  be  8-adjacent;  this  is 
consistent  with  intuition  about  the  connectedness  of  objects.  Hence,  contour 
pixels,  being  the  border  pixels  of  an  object,  are  8-adjacent.  Consequently,  the 
background  is  4-adjacent,  Consider  the  pattern  of  pixels  in  Figure  9.1.  If  the 
pixels  labeled  with  Is  represent  object  pixels  and  those  with  Os  represent 
background  pixels,  then  the  two  object  jixels  are  connected  and  the  two 
background  pixels  are  not. 
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Contour  extraction  can  be  preceded  by  processing  such  as  radiometric 
correction  and  rectiGcation.  Subsequent  processing,  once  contours  have  been 
extracted,  may  range  from  simple  contour  highlighting  to  shape  analysis  and 
classification  involving  significant  additional  computation  The  specific  context 
of  the  contour  extraction  scenario  would  depend  on  the  application.  Processing 
of  an  image  subsequent  to  contour  extraction  is  beyond  the  scope  of  this 
chapter. 


0.3  Overlapped  Subimage  Method 

A  set  of  serial  algorithms  that  produces  interpretation  results  from 
digitized  imagery  using  the  methods  outlined  in  the  preceding  section  has  been 
implemented  at  Purdue  I'niversity  on  a  \’AX- 11/780  computer.  This  method 
analyzes  an  image  in  square  subimages  that  are  processed  independently  and 
sequentially.  Objects  must  be  completely  contained  within  some  subimage  if 
they  are  to  be  found.  To  facilitate  this,  subimages  are  overlapped  50  percent 
horizontally  and  vertically  The  approach  is  referred  to  as  the  overlapped 
ffubimage  method  for  this  reason.  The  overlap  guarantees  that  objects 
restricted  to  a  maximum  dimension  of  (r/2)-  1  piiels  (picture  elements),  for  a 
subimage  size  of  r  x  r,  will  be  contained  in  their  entirety  in  .some  subimage.  A 
typical  value  for  r  is  256;  a  typical  value  for  image  size  is  5120  x  5120. 

Edge  information  has  been  used  to  guide  threshold  selection  to  achieve 
segmentation  that  more  accurately  separates  objects  of  interest  from  other 
parts  of  an  image  [MilTfl,  NeP82].  The  first  step  of  the  overlapped  subimage 
methfid  is  to  chonse  threshdids  using  edgr-guuicd  thresholding  (ECiTj  [SuR82l 
tf)  segment  til*'  imag'e  Ii  seleris  thresh'>lil  level;  Using  an  edgi'-rnatching 


criterion  instead  of  the  rla.ssical  technique  using  loea!  inmimum  values  m  the 
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image  gray  level  histogram  [PrM66].  Initially,  an  edge  image  is  generated  using 
the  Sobel  edge  operator  [DuH73).  A  figure  of  merit,  indicating  how  well  a  given 
thresholded  version  of  the  input  image  matches  edges  in  the  edge  image,  is  then 
computed.  Thresholds  with  high  figures  of  merit  are  selected  for  requantizing 
the  input  image  prior  to  contour  tracing.  A  median  filter  [GaWSl]  can  be 
applied  immediately  after  thresholding  to  remove  isolated  noise  artifacts. 
F  requently,  EGT  gives  better  results  than  the  histogram  method  because  it  is 
able  to  detect  small  objects  not  dbcernible  from  the  histogram  [SuR82].  The 
EGT  algorithm  will  be  detailed  in  the  next  section. 

Because  objects  are  contained  within  some  subimage  (by  definition)  when 
using  the  overlapped  approach,  a  simple  procedure  suffices  for  tracing  the 
contours  generated  by  segmentation.  First,  only  complete,  or  closed  contours 
need  be  traced.  Second,  if  a  subimage  is  scanned  left  to  right  and  from  top  to 
bottom  then  objects  will  be  first  encountered  at  the  leftmost  pixel  of  their 
uppermost  row  of  pixels.  Thus,  the  contour  tracing  algorithm  can  be 
formulated  to  take  advantage  of  the  fact  that  the  location  of  the  region 
bounded  by  the  contour  to  be  traced  is  known:  namely,  all  region  pixels  are 
either  to  the  right,  below,  or  both  right  and  below  the  first  found  pixel. 
(  ontours  are  stored  as  a  sequence  of  x-y  coordinates. 

The  overlapped  subimage  approach  to  contour  extraction  was  described  in 
an  image  shape  analysis  method  [MRF81]  directed  toward  classifying  small 
well-defined  objects  such  as  buildings  and  airplanes.  In  [MRF81)  extracted 
’ontours  are  used  for  object  classification  by  comparing  the  contours  with 
prototypic  object  models  using  either  Fourier  descriptors  [WaWSO]  or  standard 
moments  [Hu62,  Te;i80l.  The  overlapped  subimage  method  yields  good  results 
for  this  application,  despite  large  brightness  variations  across  an  image,  but  is 
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computationally  intensive.  Execution  time  can  be  significantly  reduced  by 
exploiting  the  parallelism  inherent  in  contour  extraction. 


0.4  Non-Overlapped  Subimage  Method 

The  overlapped  subimage  method  as  it  stands  is  not  particularly  well 
suited  for  parallel  processing.  If  each  overlapped  subimage  is  assigned  to  a 
processor  then  either  much  image  data  must  be  passed  between  processors  or 
what  amounts  to  four  copies  of  the  image  must  be  stored.  This  is  because  the 
overlapping  causes  any  one  pixel  (except  for  a  few  near  the  corners  of  the 
image)  to  appear  in  four  subimages.  For  parallel  processing  a  non-overlapped 
approach  will  be  used.  With  this  approach  each  PE  will  be  assigned  one 
subimage,  and  subimages  will  not  overlap.  Note  that  assigning  more  than  one 
subimage  to  a  PE  using  the  non-overlapped  method  is  logically  equivalent  to 
assigning  only  one  subimage,  but  with  a  larger  size. 

An  R  X  R  pixel  image  is  represented  by  an  array  of  R^  pixels,  where  the 
value  of  each  pixel  is  assumed  to  be  an  unsigned  integer  representing  one  of  the 
possible  gray  levels.  To  implement  contour  extraction  on  an  SIMD/MIMD 
machine  of  N  PEs,  assume  that  the  PEs  are  logically  configured  as  a  x  \/N 
grid,  on  which  the  RxR  image  is  superimposed,  i.e.,  each  processor  has  an 
(R/v^)  x(R/v^)  subimage  (see  Figure  9.2(a)).  For  example,  if  R=5120, 
each  PE  stores  a  160  x  160  subimage.  Each  pixel  is  uniquely  addressed  by  its 
i-x-y  coordinates,  where  x  and  y  are  the  x-y  coordinates  of  the  pixel  in  the 
subimage  contained  in  PEj.  This  data  allocation  minimizes  the  perimeter  of  a 
subimage;  the  value  of  this  is  discussed  in  the  next  section 
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(a)  Data  allocation  for  a  R  x  R  image  using  N  PEs. 

(b)  Data  transfers  needed  for  Sobel  edge  oporator. 
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0.4.1  EGT  Algorithm 

The  parallel  EGT  algorithm  operates  on  each  subimage  independently  and 
is  performed  by  all  PEs  simultaneously.  It  consists  of  three  major  conceptual 
steps.  First,  an  edge  image  is  generated,  then  a  figure  of  merit  is  computed  for 
every  possible  threshold,  and  finally,  threshold  levels  are  selected  corresponding 
to  local  maxima  in  the  figure  of  merit. 

Let  the  image  1  be  R  x  R  and  I(x,y)  be  a  pixel,  where  0  <  x,y  <  R-1. 
Let  the  subimage  assigned  to  a  processor  be  r  x  r,  where  r  =  R/\/N.  In  the  first 
step,  the  Sobel  operator  is  performed  by  each  PE  for  its  subimage,  generating 
the  edge  image.  Data  from  adjacent  subimages  is  acquired  via  the 
interconnection  network,  all  subimages  receiving  pixels  simultaneously.  The 
number  of  parallel  transfers  needed  for  an  R  x  R  image  is  4  *  (R/\/N  +  1),  as 
shown  in  Figure  9.2(b).  The  Sobel  operator  (ignoring  image  border  pixels  for 
clarity)  is  given  in  Figure  9.3.  The  value  g(x,y)  represents  the  gradient  at 
pixel  (x,y);  these  values  form  the  edge  image.  High  edge-image  pixel  values 
indicate  the  presence  of  an  edge.  The  pixels  immediately  adjacent  to  the  edges 
of  a  subimage  are  needed  in  the  Sobel  calculation  for  those  pixels  at  the  edges 
of  a  subimage.  Thus,  minimizing  the  perimeter  of  a  subimage  also  minimizes 
the  number  of  pixels  that  must  either  be  acquired  from  the  adjacent  subimages 
or  initially  loaded  info  the  memory  of  more  that  one  PE.  For  this  reason 
square  subimages  are  preferable  to  rectangular  ones  Typically,  the  Sobel  value 
for  pixels  at  the  edge  of  an  image  are  either  not  calculated  or  are  calculated 
assuming  that  pixel  locations  falling  outside  the  image  correspond  to 
background  pixels  In  this  work  pixel  locations  outside  the  image  are  assumed 
to  have  the  value  of  background  [>ixe|s. 


for  X  =  0  to  r  -  1  do  begin 

for  y=0tor-l  do  begin 

sx(x,y)  =  ^l(I(x-I,y-l)  +  2*l(x-l,y)  +  I(x-l,y  +  l)) 

4  I 

-(l(x  +  l,y-l)  +  2*I(x  +  l,y)  +  I(x+l,y  +  l))] 

sy(x,y)  =  -^[(I(x-l.y-I)  +  2*I{x,y-l)  +  I(x-fl,y-l)) 

-  (l(x-l,y  +  1)  +  2*I{x,y  + 1)  +  I(x  +  l,y  4-l))j 
g(x,y)  =  \/sx(x,y)2  +  sy(x,y)2 

end 

end 


Figure  9.3  Sobel  operator  algorithm  defined  for  a  subimage. 


The  next  conceptual  step  of  the  EGT  algorithm  is  to  determine  the  best 
threshold  values  for  each  subimage.  The  figure  of  merit,  M(T),  is  a  measure  of 
how  well  the  edges  generated  by  a  given  threshold,  T.  match  the  detected 
edges.  It  is  computed  by  each  PE  for  its  subimage  as  follows. 

1.  For  each  input-image  pixel  the  local  maximum  and  minimum  pixel 
values  in  a  3  X  3  window  centered  on  the  pixel  are  determined. 

2.  For  each  possible  threshold  value  (i.e.,  all  gray  levels)  each  input-image 
pixel  is  tested  to  see  if  it  is  an  edge  point.  It  is  an  edge  point  for  those 
threshold  levels  greater  than  or  equal  to  the  local  minimum  and  less 
than  the  local  maximum. 

3.  The  figure  of  merit  for  a  threshold  is  the  mean  of  the  edge-image  pixels 
corresponding  to  the  input-image  pixels  found  to  be  edge  points  for  that 
threshold . 


The  greater  the  mean  of  the  edge  points  in  Step  3,  the  better  the  match 
between  threshold-generated  contours  and  the  edges  detected  by  the  Sobel 
f)perafor  To  avoid  assigning  a  high  figure  of  merit  to  a  small  number  of  noise 
pixels,  a  small  ccinstant  c  can  be  added  to  the  denominator  when  calculating 
the  mean,  i.e.. 


M(T)  = 


SK(x,y) 

«(T)  _ 

c  + 


,(T) 

where  ((T)  is  the  set  of  edge-image  pixels  corresponding  to  edge  points  for 
threshold  T  This  h;is  the  cfTect  of  decreasing  M(T)  if  only  a  small  number  of 
pixels  are  above  the  threshold  Threshold  levels  are  selected  corresponding  to 
local  maxima  in  the  figure  of  merit.  The  number  of  threshold  levels  selected 


depends  on  the  image;  three  to  six  thresholds  are  not  uncommon. 

The  preceding  conceptual  steps  are  embodied  in  the  EGT  algorithm  shown 
in  Figure  9.4.  Let  the  subimage  SI  be  rxr  and  SI{i,x,y)  be  a  subimage  pixel, 
where  0  <  x,y  <  r,  0  <  i  <  N.  The  algorithm  is  executed  on  all  subimages 
(all  i)  simultaneously  and  combines  the  calculations  involved  in  the  three 
conceptual  steps  where  possible  to  reduce  total  computation. 

Referring  to  Figure  9.4.  the  first  for  statement  clears  the  aumedge  and 
nedge  counters.  The  next  pair  of  nested  for  statements  calculates  quantities 
associated  with  each  pixel  in  the  subimage.  The  first  three  statements  compute 
the  Sobel  operator,  g(i,x,y).  Next,  the  local  maximum  and  minimum  pixel 
values  over  a  3  x  3  window  arc  determined  for  each  pixel.  Note  that  the  same 
pixels  necessary  for  the  calculation  of  the  gradient  can  be  re-used  in  the  local 
tn.iMmum  and  minimum  computation.  Running  sums  of  the  edge-image  pixels 
(gradient  values)  corresponding  to  edge  points  at  each  threshold  {aumedge)  and 
a  count  of  the  number  of  edge  pixels  for  each  threshold  (nedge)  are  updated  in 
the  innermost  for  loop  The  mean  for  each  threshold  (aumedge  divided  by 
nedge.)  is  the  figure  of  merit  (merit)  and  is  calculated  in  the  final  for  statement 
using  the  accumulated  sums. 

For  any  threshold-based  segmentation  scheme,  regions  may  intersect  if 
more  than  one  threshold  is  chosen  per  subimage.  Note  that  for  any  two  such 
regions,  the  pixels  of  one  must  be  a  subset  of  the  other.  For  example,  an 
v  bject  may  be  segmented  from  the  background  by  several  threshold  levels  but 
with  .somewhat  different  perimeters  in  each  case.  The  action  to  take  when 
regions  intersect  can  be  left  to  the  discretion  of  the  end  user  of  the  processed 
images:  it  is  highly  application  dependent.  Knowledge  that  regions  intersect 
can  be  valuable  to  subsequent  object  classification  algorithms:  a  set  of  nested 
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for  thresh  =  0  to  255  do 

sumedge(i, thresh)  =  nedge(i, thresh)  =  0 

for  X  =  0  to  r  -  1  do  begin 
for  y  =  0  to  r  -  1  do  begin 


sx(i,x,y) 


(Sl(i,x-l,y-l)  +  2*SI(i,x-l,y)  +  SI(i,x-l,y  + 1)) 


-  (SI(i,x +l,y-l)  +  2*Sl(i,x  +  l,y )  +  SI(i,x  +  l,y  +  1 )) 


sy(i,x,y)  =  - 

4 


(SI(i.x-l,y-l)  +  2*SI(i.x,y-l)  +  SI(i.x  +  l.y-1)) 

-  (SI(i,x~l,y  +  1)  +  2*SI(i,x,y  +  1)  +  SI(i,x  +  l,y  +  1)) 


g(i,x,y)  =  \/sx(i,x,y)-  +  sy(i,x,y)* 

localmax(i)  =  max  |si(i,X”l ,y“l ), SI(i.x  ,y-l ),  Sl(i,x  +  l,y-l ). 

SI(i,x-l,y).Sl(i,x,y),SI(i,x  -t-l,y). 

SI(i,x-l,y  + 1),  Sl(i,x,y  +  l),Sl(i,x  +  l,y+  1)| 
localmin(i)  =  min  |si(i,x-l,y-l),  SI(i,x,y-l),  SI(i,x  +  l,y-l ), 
Sl(i,x~l,y).I^l(i,x,y).SI(i,x  •+  l,y), 

SI(i,x-l.y  +  1 ),  SI(i.x,y  +  1),  Sl(i,x  +  l,y  +  1)| 

for  thresh  =  k)calmin(i)  to  localmax(i) -  I  do  begin 
sumedge(i. thresh)  =  surnedge(i,  thresh)  +  g(i.x,y) 
nedge(i, thresh)  =  nedge(i, thresh)  +  1 

end 

end 

end 

for  thresh  =  0  to  255  do 

meritfi, thresh)  -  siimedge(i,thrcsh)/(c  +  nedge(i, thresh)) 


I  I  cure  t)  1  I’aralli'l  algorithm  for  I’CTr  at  f’l'. 
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regions  might  be  considered  a  single  object,  and  a  clear  classification  of  any  one 
deemed  the  classification  of  all. 

0.4.2  Contour  Tracing  Algorithm 

Contour  tracing  with  non-overlapped  subimages  is  more  difficult  than  for 
the  overlapped  subimage  scenario  for  two  reasons.  One  arises  out  of  the  fact 
that  there  can  be  no  assurance  that  a  given  object  will  reside  entirely  within 
some  subimage.  Coping  with  this  fact  does,  however,  allow  objects  of  arbitrary 
size  to  be  accommodated  with  no  additional  effort.  The  second  difficulty  arises 
out  of  the  fact  that  contour  tracing  is  to  be  done  by  multiple  processors  that 
must  coordinate  their  activity.  Once  a  one  pixel  wide  border  of  image  data 
around  a  given  subimage  is  in  place  the  EGT  algorithm  requires  no  further 
inter-l’E  communication.  Contour  tracing  can  require  access  to  the  entire 
contour  by  any  PE;  if  that  data  is  not  to  be  duplicated  on  a  truly  massive 
scale,  PFCs  must  communicate. 

The  algorithm  is  described  in  a  top-down  manner  Initial  discussion  is  in 
general  terms  and  presents  the  algorithm  strategy.  Further  discussion  evolves 
to  levels  of  greater  detail. 

The  non-overlapped  algorithm  for  contour  tracing  has  two  phases.  In 
Pha.se  I,  the  subimage  is  segmented  within  each  PE  and  all  local  contours 
(t)oth  closed  and  partial)  are  traced  and  recorded.  A  partial  contour  is  a 
section  of  a  contour  with  ends  at  the  border  of  the  subimage  in  which  it  is 
h  eated  The  partial  contours  traced  during  Phase  1  are  connected  across 
siihimages  to  form  closed  contours  in  Phase  D  Initially,  each  PE  contains  the 

list  of  thresfiold  values,  TpTo . Tj.  selected  for  its  subimage,  where  t  is  the 

number  of  thresholds  .Also,  each  F’E  is  assumed  to  have  retained  those  pixels 
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bordering  its  subimage  that  were  required  by  the  EGT  algorithm.  This  will 
allow  all  contours,  closed  and  partial,  within  a  subimage  to  be  detected,  even 
those  that  run  along  the  edge  of  a  subimage.  A  subimage  and  its  border  pixels 
together  form  an  augmented  aubimage. 

During  Phase  I  each  PE  constructs  a  contour  table  containing  an  entry  for 
every  contour  located  in  its  subimage,  whether  partial  or  closed.  Each  entry 
has  the  following  fields: 

(a)  contour  identification  number, 

(b)  threshold  value  that  generated  the  contour, 

(c)  flag  indicating  whether  the  contour  is  closed  or  partial, 

(d)  pointer  to  the  pixel  i-x-y  coordinate  sequence  of  the  contour, 

(e)  number  of  pixels  in  the  contour, 

(f)  flag  indicating  whether  the  partial  contour  has  been  linked, 

(g)  address  of  the  PFi  that  linked  the  contour, 

(h)  PE  address  and  contour  identification  number  denoting  any  partial 
contour  blocking  extension  of  the  partial  contour  described  by  this 
table  entry,  and 

(i)  partial  contour  iocked/unlocked  semaphore. 

Each  F’E  also  builds  a  partial  contour  Hat.  This  list  has  an  entry  for  each 
partial  contour  containing  the  information  necessary  for  connecting  partial 
contours  in  f’hase  II  recorded  under  the  corresponding  contour  identification 
number  as  used  in  the  contour  table  (i.e..  a  pointer  to  its  contour  table  entry) 

Phase  I.  In  Ph  ase  1  e.ach  PE  processes  its  augmented  subimage  for  each 
of  its  threshold  levels  T,,  1  <  i  <  t.  E.aeh  threshold  is  processrd 


independently.  With  a  given  threshold  selected,  the  augmented  subimage  is 
segmented.  All  contours,  both  closed  and  partial,  are  traced  and  added  to  the 
partial  contour  list  and/or  contour  table,  as  appropriate.  Phase  I  continues 
until  all  thresholds  have  been  used  in  turn. 

Contour  tracing  for  a  particular  threshold  begins  by  searching  for  untraced 
contours  This  search  is  carried  out  by  scanning  rows  of  the  segmented 
augmented  subimage  in  a  pattern  from  left  to  right,  beginning  with  the  second 
row  (first  row  with  subimage  pixels)  and  continuing  through  the  next  to  last 
row  (last  row  with  subimage  pixels).  VV’hen  the  rows  have  been  scanned,  a  final 
scan  is  made  vertically  on  the  second  column  of  the  augmented  subimage 
(leftmost  column  with  subimage  pixels).  Scanning  is  suspended  whenever  an 
untraced  contour  is  encountered,  the  new  contour  is  traced,  its  pixels  are 
marked  as  traced,  and  then  scanning  is  resumed  in  order  to  detect  additional 
contours. 

The  sequence  of  background  pixels  and  unmarked  object  pixels 
encountered  by  the  scan  indicates  when  an  untraced  contour  has  been  found. 
P.ecause  any  contour  pixel  must  have  a  neighboring  background  pixel  (by 
iefinitK'n  of  a  contour),  and  background  pixels  are  4-adjacent,  every  pixel  on  a 
'ontoiir  has  a  4-adJacent  background  pixel.  Thus,  whenever  the  scan 
en'Miinters  an  unmarked  object  pixel  within  the  subimage  thsit  has  either  a 
preceding  or  succeeding  background  pixel,  a  new  contour  has  been  found.  The 
contour  pixel  at  which  scanning  is  suspended  is  the  start  point  of  that  contour. 
.\  contour  IS  assigned  an  identification  number  (contour  table  field  (a))  and  the 
threshold  in  use  is  recorded  (field  (b))  when  a  start  point  is  found.  The  row 
-'  .ins  will  detect  all  contours  having  at  least  one  pixel  with  a  horizontal  4- 
neighhor  background  pixel,  even  if  that  background  pixel  is  part  of  the 


augmented  subimage  but  not  the  subimage.  The  row  scans  will  miss  those 
contours  having  only  vertical  4-neighbor  background  pixels,  i.e.,  contours  that 
are  horizontal  lines  extending  entirely  across  the  augmented  subimage.  The 
column  scan  will  detect  these  remaining  contours,  if  there  are  any.  In  this  way, 
the  augmented  subimage  scanning  procedure  will  detect  all  contours,  closed 
and  partial,  within  a  subirnage. 

Once  the  start  point  for  a  new  contour  has  been  found,  that  contour  is 
traced.  The  direction  in  which  a  contour  is  traced  has  either  a  clockwise  or 
counterclockwise  sense  with  respect  to  the  object.  If  a  contour  is  closed  within 
a  subimage  it  can  be  completely  traced  by  finding  successive  contour  pixels, 
proceeding  in  only  one  of  the  two  directions,  and  stopping  when  the  trace 
returns  to  the  start  point.  For  a  contour  contained  only  partially  within  the 
subimage,  tracing  in  one  direction  (e  g.,  counterclockwise)  will  proceed  until  the 
subimage  boundary  is  reached.  An  end  point  is  a  location  on  a  contour  at 
which  the  next  object  pixel  along  the  contour  in  the  direction  of  tracing  is 
outside  the  subimage  (i.e  ,  within  the  one  pixel  border).  In  order  to  complete 
(he  trace  of  the  contour  within  the  subimage,  it  is  necessary  to  return  to  the 
start  point  and  trace  in  the  opposite  direction.  The  order  in  which  the 
directions  are  taken  is  immaterial:  all  contour  pixels  can  be  found  either  way. 
For  this  algorithm  an  arbitrary  choice  is  made  to  trace  counterclockwise  first. 

In  order  to  define  the  conventions  for  tracing,  consider  the  start  point  to 
be  the  center  of  the  3  x  3  window  shown  schematically  iu  Figure  6  5.  During  a 
row  scan,  a  preceding  background  pixel  would  be  at  position  4  (see  Figure  6.5), 
and  a  succeeding  background  pixel  would  be  at  position  0  F’nr  the  column 
scan  the  positions  are  ‘J  and  6,  respectively.  \'iew  the  contour  pixel  and  its  4- 
adjacent  background  pixel  as  defining  a  hand  of  a  clock.  If  this  hand  i.s  rotated 
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Figure  9. 
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in  the  auginentod  suhimagc  then  the  first  ofijoct  pixel  it  encounters  is 
guaranteed  to  be  on  the  exterior  of  the  object  (i  e.,  4-adjacent  to  the 
background),  and,  hence,  a  contour  pixel.  If  the  hand  is  rotated 
counterclockwise,  the  first  object  pixel  is  also  the  next  pixel  in  a 
counterclockwise  trace  of  the  contour. 

The  counterclockwise  tracing  portion  of  Phase  I  of  the  tracing  algorithm  is 
as  follows.  Let  p  be  a  position  marker,  and  assign  it  the  value  of  the  position 
number  of  the  background  pixel  that  is  adjacent  to  the  start  point.  I'sing 
modulo  8  arithmetic,  increment  {>  by  1  (accomplishing  a  counterclockwise 
inspiection  of  the  8-adjacent  pixels)  and  check  the  cc^rri'spond ing  pixel  until  one 
of  the  following  conditions  is  met;  the  start  point  is  reached,  an  end  point  is 
detected,  an  unmarked  object  pixel  is  found,  a  marked  object  pixel  is  found,  or 
the  initial  value  of  ft  is  reach<-d.  C'oriditions  are  checked  in  the  order  listed. 
Figure  9  ')(b)  shows  an  example  in  which  the  background  pixel  was  preceding 
and  in  position  4.  The  value  of  p  is  thus  4,  and  incrementing  it  using  modulo  8 
arithmetic  to  the  value  .a  sw(*e[)s  an  imaginary  hand  counterclockwise  from  the 
9  ri'rlcick  position  .\s  shown  in  the  I'xarnple.  an  object  pixel  will  be 
encounter'd  after  incrementing  p  once  more 

If  the  start  point  is  reached  the  contour  is  closed  and  ha.s  been  completely 
traced,  the  necessary  information  is  stored  in  contour  table  fields  (c),  (d),  and 
(e)  and  scanning  is  resumed  Detecting  an  end  point  implii's  the  contour  is 
partial  Note  that  an  end  point  is  a  just  previously  traced  object  contour  pixel. 
Th  e  contour  identification  number  and  location  of  the  end  ['omt  are  recorded 
111  til"  partial  '"ii’oiir  table  f,.r  um  in  I’liase  II,  contour  table  fndd  (.-l  is  set, 
coiinti  rclock  w  is"  tracing  is  t'-rminaled  and  (lo.kwisr  tracing  f'egun  from  tli" 
start  point  If  an  unmarked  object  j-ixel  is  found  its  i-\-y  ciHirdinate  is 
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appended  to  the  contour  sequence,  it  is  marked  as  traced  using  a  bit  map,  a 
new  value  of  p  is  computed,  and  counterclockwise  tracing  continues  from  this 
new  location  The  new  value  of  p  is  p  +  5  m  idulo  8  and  is  the  location  of  the 
background  pixel  adjacent  the  previous  contour  pixel.  Such  contours  will 
indicate  some,  perhaps  many,  intersecting  regions.  Finding  a  marked  object 
j  !\*1  miplii's  there  is  no  interior  to  the  object  bounded  by  the  contour,  at  least 
;n  the  immediate  area  Desirable  action  in  this  case  may  range  from  ignoring 
-  1  h  (  mtours  to  eliminating  one  pixel  wide  regions  to  treating  such  contours  as 
ifiy  it  Ilf  r.  the  end  user  will  choose  If  the  initial  value  of  p  is  reached,  then  all 
•  icfit  neighbor  pixels  have  been  examined  without  finding  one  in  any  of  the 
abiof  (atfgories,  so  the  otiject  point  is  a  single  isolated  point  and  will  be 
igiiorf'd  No  entry  is  made  in  the  contour  table,  the  pixel  Ls  marked,  and 
•'annmg  is  rt'sumed. 

riiere  are  some  special  cases  that  can  arise  during  the  course  of  tracing.  If 
intersecting  regions  are  of  interest  to  the  end  application,  the  bit  map  can  be 
e<  iierali/ed  to  one  bit  map  per  threshold  level,  which  will  make  it  easy  to 
^  intersecting  contours  If  isolated  points  should  not  be  ignored  for  a 
1  artu'ular  a[>pli<ation.  they  can  lie  recorded  as  contours  before  scanning  is 
rf-'ii  med 

The  procedure  f-or  clockwise  tracing  is  analogous  to  that  for 
'  "Uiiif  r'  iiH'kwisc  tracing  I'he  clockwise  trace  is  begun  by  returning  to  the 
t.irt  jinint  ami  restoring  p  to  its  value  when  counterclockwise  tracing  began. 

I ).','r<'m''rit  p  by  I  using  module  8  arithmetic  and  check  the  corresponding  pixel 
until  an  fiid  point  is  dflfcted  f.r  an  object  pixel  is  fc'und  If  the  second  end 
i  lilt  I''  uit'  ! o  f'ri'il.  the  location  of  that  end  j'oint  is  recordiul  in  the  partial 
<  ont our  tabb-  <  lorkwisf  tracing  is  terminatefl,  and  scanning  is  resumed.  If  an 
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object  pixel  is  found  its  i-x-y  coordinate  is  inserted  at  the  front  of  the  contour 
pixel  list,  it  is  marked  as  traced,  a  new  value  of  p  Ls  computed  as 
p-5  modulo  8,  and  clockwise  tracing  continues  from  this  new  location.  The 
position  p“5  modulo  8  is  the  location  of  the  background  pixel  adjacent  the 
previous  contour  pixel  for  a  clockwise  trace.  There  are  instances  where  all 
pixels  of  a  partial  contour  will  be  found  during  the  counterclockwise  trace;  the 
clockwise  tracing  algorithm  will  correctly  confirm  this  fact  if  true. 

Consider  the  following  contour  tracing  example  based  on  Figure  0.6.  A 
10x20  pixel  image  is  divided  into  two  10  x  10  subimages;  the  subimages  are 
loaded  into  adjacent  f'Es;  and  the  local  threshold  value  T,  is  applied  by  each 
PE.  Dots  in  the  figure  indicate  pixels  above  the  threshold  The  one-pixel-wide 
border  around  each  subimage  is  not  shown. 

Each  PEj,  0<i<l,  begins  scanning  its  subimage  at  pixel  (i,0,-l),  the 
leftmost  pixel  in  row  0,  the  first  row  of  the  augmented  subimage  with  subimage 
pixels.  PEq  locates  a  start  point  at  pixel  (0,3.3)  and  traces  the  contour 
counterclockwise  to  an  end  point  at  pixel  (0,7,9).  Next,  PEg  traces  the  contour 
clockwise  from  pixel  (0,3,3),  reaching  the  other  end  point  at  pixel  (0.3,9).  After 
the  clockwise  trace,  the  first  pixel  in  the  i-x-y  coordinate  sequence  describing 
the  contour  is  (0,3,9).  PEg  resumes  its  scan  pattern  at  pixel  (0,3.4)  and  finds 
no  other  contours  in  its  subimage.  Note  that,  for  example,  pixel  (0,4,4)  is  not  a 
start  point  for  a  new  contour  at  this  threshold  because  it  was  marked  during 
the  trace  of  the  first  contour  Meanwhile,  PE,  locates  a  partial  contour  with 
(1,7,0)  as  the  first  pixel  in  its  i-x-y  sequence. 

.\s  a  further  example  a  .30  >  20  image  is  divided  into  six  10x10 
subimages:  each  siibimage  is  ln.ided  intu  one  of  six  PIN.  In  Figures  9.7  and  9.8 
the  results  of  row  scanning  and  the  column  scan  are  shown,  respectively. 
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Figure  9  6  Example  of  Phase  1  contour  tracing  for  a  10x20  image.  The 
triple  (i,x,y)  represents  the  i-x-y  coordinates  of  a  pixel. 


Object  pixel  O-  Clockwise  trace  mark 

G  Start  point  <  End  point  (counterclockwise) 

■G  Counterclockwise  t>  End  point  (clockwise) 

trace  mark 


Figure  fl.7 


Contours  found  by  F’hase  I  row  scans  for  a  30  x  20  sample  image 
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Figure  0  8  Contours  found  by  the  Phase  I  column  scan  for  the  image  of 
Figure  0.7. 
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Phase  n.  In  Phase  II,  partial  contours  are  connected  to  form  closed 
contours.  Two  possible  conditions  to  determine  when  a  PE  may  enter  Phase  11 
are:  (1)  no  PE  starts  Phase  11  processing  until  all  have  completed  Phase  I,  and 
(2)  each  PE  enters  Phase  11  immediately  after  completing  Phase  I.  For  the 
latter  condition  a  PE  must  be  restricted  to  attempt  to  extend  partial  contours 
only  into  subimages  of  PEs  which  are  also  in  Phase  II.  If  all  neighboring  PEs 
are  still  in  Phase  1,  the  PE  must  wait.  The  first  alternative  requires  time  equal 
to  the  sum  of  the  times  for  each  phase  when  phases  are  carried  out 
sequentially.  The  second  approach  may  be  expected  to  reduce  execution  time, 
because  when  phases  are  executed  sequentially,  the  PE  with  the  longest 
Phase  1  time  need  not  be  the  one  with  the  longest  Phase  II  time.  However, 
because  the  work  a  PE  does  is  data  dependent  and  can  also  depend  on  whether 
all  PEs  start  Phase  II  together  or  not,  the  second  approach  may  increase 
execution  time  in  some  cases.  It  is  problematic  as  to  which  approach  is 
superior. 

In  Ph  ase  n  each  PE  attempts  to  connect  its  partial  contours  to  partial 
contours  located  in  neighboring  subimages,  extending  each  partial  contour  so  as 
to  eventually  form  a  closed  contour.  The  goal  is  for  all  closed  contours  to  be 
formed,  but  only  once.  Exactly  one  I’lC  should  hold  the  i-x-y  sequence  for  each 
contour  at  the  end  of  Phase  11  The  question  of  where  to  look  for  extending 
partial  contours  must  be  addressed. 

If  the  end  point  of  a  given  partial  contour  is  not  at  a  corner  of  its 
subimage,  then  there  are  three  pixels  that  possibly  can  extend  it,  located  in  the 
adjacent  subimage.  That  is,  there  are  three  locations  in  the  adjacent  subimage 
that  are  8-neighbors  of  the  given  end  p(unt.  The  PI-.'  accesses  the  partial 
contour  list  for  the  adjacent  subimage  and  considers  the  possible  extending 


pixels  one  at  a  time  in  counterclockwise  order  to  determine  if  any  partial 
contours  in  the  adjacent  subimage  have  the  possible  extending  pixel  as  an  end 
point.  Counterclockwise  order  is  simply  the  convention  chosen.  If  there  is 
more  than  one  partial  contour  with  the  same  end  point  that  can  extend  the 
given  contour,  the  partial  contour  that  was  generated  by  a  threshold  value 
closest  to  that  for  the  given  contour  is  selected 

To  correctly  link  contours  and  prevent  duplication  of  linking  effort,  partial 
contours  can  be  locked  to  provide  for  exclusive  access.  If  an  extending  partial 
contour  exists,  the  FE  checks  the  contour  table  entry  pointed  to  by  the  partial 
contour  list  If  the  entry  is  unlocked  and  unlinked,  the  i-x-y  sequence  for  the 
partial  contour  is  transferred  and  concatenated  to  the  given  partial  contour, 
forming  the  i-x-y  sequence  of  the  extended  partial  contour.  The  PE  also  sets 
the  flag  indicating  that  the  adjacent  partial  contour  was  linked  and  leaves  its 
address  (contour  table  Gelds  (f)  and  (g)  of  the  remote  PE).  Contour  table  field 
(f)  notifies  the  PE  containing  that  partial  contour  not  to  expend  effort 
extending  that  particular  partial  contour. 

If  the  end  point  of  a  given  partial  contour  is  a  corner  point  of  its 
^ubiinage  there  are  five  pixels  located  in  adjacent  subimages  that  can  possibly 
extend  the  contour.  Since  these  five  pixels  are  located  in  three  different 
subimages,  the  PE  attempting  to  extend  the  given  partial  contour  must  check 
for  continuation  in  each  of  the  adjacent  subimages  in  a  counterclockwise  order; 
pixels  within  a  subimage  are  checked  in  counterclockwise  order. 

Note  that  regardless  of  where  end  points  lie,  the  search  for  pixels  to 
extend  the  partial  contour  can  be  widened  to  allow  for  segmentation-induced 
contour  discontinuities  at  subimage  boundaries.  Thresholds  could  be 
interpolated  across  subimage  boundaries  to  allow  partial  contours  with  non- 
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adjacent  end  points  to  be  joined.  Assume  that  has  a  partial  contour  to 
extend.  If  no  suitable  partial  contour  is  found  in  the  partial  contour  list  of  the 
adjacent  subimage,  PE;  probes  into  the  adjacent  subimage  to  determine  if  an 
extension  of  the  partial  contour  can  be  generated  by  the  threshold,  T,  used  by 
PE;  to  trace  its  partial  contour.  If  so,  PE;  extends  its  partial  contour  by 
accessing  the  data  from  the  adjacent  F’E.  Instead  of  creating  an  entire 
segmented  subimage  for  the  threshold  T,  PE,  dynamically  thresholds  pixels  as 
needed.  This  contour  generation  using  T  is  done  since  it  is  possible  that  the 
partial  contour  in  the  adjacent  PE  was  not  located  in  Phase  1  because  different 
threshold  values  were  used. 

A  lone  object  pixel  located  adjacent  to  the  border  of  a  subimage  w'ill  be 
listed  as  a  partial  contour  during  Phase  1  During  Phase  II  such  a  pixel  will 
either  be  linked  to  a  partial  contour  from  an  adjacent  PE  or  identified  as  an 
isolated  point.  Isolated  points  can  be  detected  by  their  length  as  partial 
contours  and  the  lack  of  extending  contours.  Once  detected  they  can  be 
treated  as  other  isolated  points. 

Once  f’E;  locates  a  partial  cf)ntour  in  an  adjacent  subimage  that  continues 
the  given  partial  contour  and  has  stored  the  concatenated  contour,  it  repeats 
the  process,  as  necessary,  by  following  the  contour  to  the  next  PE  until  the 
contour  is  closed  or  cannot  be  extended  A  limit  is  placed  on  contour  length  to 
guarantee  algorithm  termination  in  the  event  of  a  pathological  image;  contour 
length  can  be  determined  from  contour  table  field  (c) 

Because  many  processors  are  each  working  independently,  yet  their  actions 
must  he  coordinated  to  reach  the  goal  of  extracting  contours,  knowledge  of 
where  partial  contours  can  be  extended  is  insufficient  There  must  be  a 
protocol  to  guide  the  activity  and  necessary  support  mechanisms  must  be  in 


place.  By  way  of  illustration  consider  the  example  in  Figure  9.9  where  a 
12  X  12  pixel  image  is  divided  between  four  PEs.  After  Phase  1,  PEq  has  found 
partial  contours  A  with  end  points  (0,1.5)  and  (0,5,1)  and  B  with  end  points 
(0,4,5)  and  (0,5,3);  PE,  has  found  partial  contour  C  with  end  points  (1,4,0)  and 
(1,1,0);  and  PE2  has  found  partial  contour  D  with  end  points  (2,0,1)  and  (2,0,3). 

.Assume  PEq  attempts  to  extend  A  (A  must  not  be  marked  as  linked  in 
field  (f)|  from  pixel  (0,1,5)  It  may  well  be  that  A  is  the  first  entry  in  the 
partial  contour  table  of  PE^,  and  pixel  (0,1,5)  is  the  first  location  in  its  i-x-y 
sequence;  partial  contours  may  be  considered  for  extension  in  any  order  and 
extended  from  either  end,  however.  If  PE,  attempts  to  extend  C  at  this  time, 
the  danger  that  a  given  contour  (ACBD  in  this  instance)  can  be  formed  more 
than  once  by  multiple  PEs  is  clear.  Thus,  a  mechanism  to  prevent  one  PE 
from  using  a  contour  table  entry  while  another  PE  is  in  the  process  of  using 
that  entry  must  be  available. 

.A  semaphore  is  a  variable,  the  value  of  which  indicates  whether  or  not  a 
critical  section  can  be  entered  [Dij68]  and  can  be  used  to  enforce  exclusive 
access  to  data.  There  is  a  semaphore  for  each  contour  table  entry,  which  can 
take  on  a  value  of  zero  or  one  (field  (i)  in  the  contour  table).  Before  a  PE 
enters  a  critical  section  for  a  given  contour,  it  performs  a  P-operation  [Dij68) 
to  determine  if  the  semaphore  for  that  contour  is  one,  indicating  it  is  unlocked. 
If  so,  the  processor  sets  the  semaphore  to  zero,  locking  the  contour  table  entry 
so  that  no  other  processor  can  access  it,  and  enters  the  critical  section  to  link 
the  new  partial  contour.  When  the  PE  reaches  the  end  of  the  critical  section. 
It  performs  a  \'-operatton  [Dij68),  resetting  the  semaphore  to  one  and 
unlocking  the  contour  fable  entry.  If  the  P-operation  had  determined  that  the 


semaphore  was  zero,  the  PE  receives  a  message  indicating  that  the  partial 
contour  is  locked. 

Returning  to  the  example  of  Figure  9.9,  PEs  now  first  lock  a  partial 
contour  they  will  try  to  extend.  Thus,  when  PEq  attempts  to  extend  A  it  will 
receive  a  message  that  C  is  locked  If  PEq  waits  for  C  to  become  available, 
then  eventually  PF^q-  F*Ei.  perhaps  PE2,  will  all  be  waiting  for  access  to 
partial  contours  that  another  member  of  the  group  has  locked. 

The  situation  when  each  of  two  or  more  processes  are  halted  while  waiting 
for  the  other(s)  to  continue  is  known  as  deadlock  [Sto80].  In  MIMD  mode  each 
PE  executes  a  distinct  process,  so  if  PEs  wait  on  each  other  it  is  deadlock.  If  a 
PE  is  blocked  due  to  a  locked  partial  contour  then  deadlock  is  prevented  if  (1) 
no  PE  may  wait  for  access  to  a  locked  contour  table  entry  of  another  PE,  and 
(2)  all  PEs  must  unlock  blocked  partial  contours.  This  deadlock  avoidance 
scheme  is  one  aspect  of  the  contour  tracing  protocol  needed  for  Phase  II. 

When  a  PE  successfully  extends  a  partial  contour  it  performs  the 
appropriate  modification  on  the  (f)  and  (g)  fields  of  the  contour  table 
containing  the  extending  partial  contour.  Once  again  considering  Figure  9.9, 
a-ssume  PFOi  has  formed  CB  and  PEo  has  formed  D.\,  both  have  modified 
contour  table  entries  as  needed,  and  are  ready  to  continue  to  extend  their 
partial  contours  C  and  D,  respectively.  If  both  try  to  extend  simultaneously, 
then  both  will  abandon  their  partial  contours,  and  the  closed  contour  will  not 
be  found  The  remaining  a.spect  of  the  contour  tracing  protocol  for  Phase  11 
addresses  this  problem. 

To  insure  that  the  linking  of  a  contour  will  not  be  abandoned  by  all  PEs, 
the  following  protocol  is  used.  A  total  ordering  is  established  on  the  PEs. 


Given  a  binary  relation  R  ('U  every  pair  of  elements  a  and  b  in  set  S,  if  either 
aRb  or  bRa  then  R  totally  orders  the  set  S,  For  example,  <  totally  orders  the 
set  of  all  integers.  One  convenient  total  ordering  is  the  <  relation  on  PE 
addresses;  clearly,  others  are  possible.  Assume  PI],  is  blocked  from  extending  a 
partial  contour  X  because  PE^,  which  has  higher  priority  (i.e.,  i  <  j),  has  locked 
the  partial  contour  PE;  needs.  In  that  case,  PE;  unlocks  X  and  sends  a 
message  informing  PEj  that  PE;  has  abandoned  its  attempt  to  extend  X.  The 
message  sent  from  PEj  to  PEj  contains  the  identification  numbers  of  X  and  Y 
and  the  value  i.  After  receiving  the  message,  Pl'-j  searches  its  contour  table  to 
determine  if  it  abandoned  Y.  To  do  this  it  uses  field  (h)  of  the  contour  table. 
If  it  had,  it  now  begins  extending  Y  by  adding  X  from  PEj  (unless  X  has  been 
locked  in  the  interim  by  yet  another  PE  tracing  a  portion  of  the  contour 
comprised  in  part  by  X  and  'i'l  and  continues  until  it  has  closed  the  contour  or 
is  blocked. 

The  deadlock  avoidance  and  IT]  priority  protocols  allow  F’Es  to  attempt  to 
extend  their  partial  contours  m  any  order  and  from  either  end  point  with 
correct  results  assured  This  is  advant.age< uis  if  a  long  contour  extending  across 
many  subimages  is  to  be  formed  during  Phase  II  from  its  constituent  partial 
contours.  Many  Pl]s  can  all  vsork  simultaneously  to  extend  partial  contours 
along  this  contour,  each  dropping  out  to  work  any  other  partial  contours 
that  might  be  in  its  suldmage  as  it  is  blocked  by  a  higher  priority  I’E.  In 
effect,  cfintour  tracing  is  perforrin'd  using  the  dividi-and-conqiier  model  at  both 
the  image  level  (sul  image  per  PI')  and  the  contour  level  (possilde  simultaneous, 
mull  iph'-l’l .  parti, il  c.,iii"ur  linking  for  a  sing!*-  c  Milour) 

When  Phase  11  . 'f  tie  .algorithm  i  c. tie'  i  \-\  sequence  for  each 
contour  ill  the  image  will  be  stori>d  in  ibe  memory  for  exactly  one  of  the  Plas 
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that  contained  part  of  the  contour  originally.  For  the  example  of  Figure  9.9, 
PEo  V  ill  contain  the  i-x-y  sequence  for  contour  DABC  in  the  event  it  and  PEj 
tried  to  simultaneously  extend  CB  and  DA,  respectively,  since  it  has  higher 
precedence  under  physical  address  ordering.  Because  each  PE  tries  to  connect 
its  contours  independently,  there  is  no  simple  rule  to  determine  which  PE  will 
finally  close  a  given  contour.  Although  this  may  be  undesirable  in  some 
instances,  a  side  effect  of  the  procedure  is  to  tend  to  equalize  the  processing 
load  of  the  PEs  and  the  number  of  closed  contours  found  by  each  PE. 

0.5  Example  of  Non-Overlapped  Method  Performance 

The  parallel  algorithms  described  in  the  previous  two  sections  have  many 
potential  applications,  one  of  which  is  the  visual  inspection  of  thick  film 
products  within  an  industrial  environment.  In  such  an  environment  scene 
illumination  can  be  closely  controlled,  so  subimage-based  EGT  may  not  be 
needed  for  its  adaptive  capability,  but  its  accuracy  may  be  desirable.  However, 
using  subimage-based  EGT  incurs  no  penalty.  A  functional  simulation  of  the 
algorithms  of  the  previous  sections  has  been  coded  and  executed;  the  results  are 
presented  in  this  section  [MiD83j. 

The  section  of  a  printed  circuit  board  in  Figure  9  10  is  shown  with  a 
16  X  16  pixel  subimage  grid,  the  subimage  size  used  in  the  simulation.  The 
board  was  purposefully  illuminated  unevenly  to  illustrate  the  capability  of 
subimage^-based  ECJT.  When  a  single  best  global  threshold  is  used  for  the 
image,  the  result  is  the  binary  image  shown  in  Figure  9  11.  Clearly,  a  single 
threshold  level  will  not  yield  satisfactory  segment  .at  ion  of  the  circuit  wires  and 
sulistrale  The  algorithm  w.a.s  used  to  determine  a  suitable  threshold  for 

each  subimage,  since  ideally  this  image  would  have  pixels  of  only  two  values,  a 
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single  threshold  in  each  subimage  suffices.  In  addition  to  using  an  additive 
constant  to  suppress  noise  when  computing  threshold  figure  of  merit  values,  the 
resulting  merit  value  functions  M(T)  were  smoothed  over  five  points  to  yield 
the  desired  single  maximum.  The  resulting  merit  value  graphs  for  a 
representative  group  of  the  subimages  are  shown  in  Figure  9.12. 

The  EGT  algorithm  becomes  more  adaptive  as  the  subimage  size 
decreases,  but  choosing  too  small  a  subimage  can  create  problems.  For 
instance,  there  may  no  longer  be  an  edge  within  a  subimage,  causing  the 
algorithm  to  select  a  threshold  related  to  the  noi.se  or  DC  change  within  the 
subimage.  There  are  some  subimages  in  the  example  image  that  do  not 
contain  an  edge.  In  these  subimages  small  gray  level  variations  cause  the  EGT 
algorithm  to  detect  false  merit  value  peaks.  It  was  necessary  to  apply  a 
threshold  to  the  merit  values  themselves  before  an  edge  was  considered 
detected  for  a  subimage. 

The  threshold  used  to  insure  that  a  chosen  merit  value  peak  actually 
corresponds  to  a  significant  edge  should  be  related  to  the  distribution  of  the 
Sobel  values  for  the  entire  picture.  In  the  case  of  the  thick  film  board,  both 
the  Sobel-generated  edge  image  and  input  image  histograms  should  be  bimodal, 
but  due  to  the  poor  lighting,  the  modes  are  smaller  in  magnitude  and  wider, 
and  may  even  be  merged.  A  percentile  of  the  Sobel  values  over  the  entire 
image  is  a  very  robust  merit  value  threshold.  Empirically,  the  50-th  percentile 
of  the  Sobel  value  histogram  is  an  appropriate  merit  value  threshold  for  this 
exairiple.  Any  value  between  the  30-th  and  70-th  percentile  gives  equivalent 
results  (see  Figure  9.12). 

Consider  the  problem  of  :i.ssigning  gray  levels  to  the  degenerate  subimages 
that  do  not  have  merit  values  above  the  merit  threshold.  Neighboring 


Figure  9.12  The  EGT  merit  value  graphs  for  the  64  subimages  in  the  upper 
left  corner  of  Figure  9.10.  The  horizontal  axis  for  each  graph  is 
gray  value  (0  to  255)  and  the  vertical  axis  is  the  threshold  merit 
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subimages  at  increasing  distance  are  searched  for  the  nearest  with  a  valid 
threshold.  Once  found,  this  valid  threshold  is  used  to  threshold  the  degenerate 
subimage.  Table  9.1  shows  the  figures  of  merit,  corresponding  thresholds,  and 
the  thresholding  decision  made  for  subimages  in  the  upper  left  corner  of  the 
image.  The  first  number  in  each  box  is  the  maximum  figure  of  merit  value,  the 
second  is  the  corresponding  gray  level  threshold  value,  and  the  third  is  an 
asterisk  if  the  corresponding  gray  value  is  a  valid  threshold,  i.e.,  figure  of  merit 
value  above  Sobel  50-th  percentile.  A  “W’'  or  “B”  as  the  third  entry  indicates 
that  a  valid  threshold  was  not  found  for  that  subimage,  and  the  default  value 
was  chosen  as  white  or  black,  respectively.  No  completely  black  subimages 
were  found  in  this  portion  of  the  image,  thus  there  are  no  “B"  entries  in  this 
table.  The  segmented  image  using  the  results  of  EGT  is  shown  in  Figure  9.13. 
Compare  this  with  the  image  segmented  using  the  best  single  threshold  shown 
in  Figure  9.11. 

The  actual  partial  contour  tracing  and  linking  is  straightforward  because 
segmentation  is  binary  in  the  example.  The  black  wire  traces  are  considered  to 
be  8-neighbor  objects,  and  the  white  substrate  a  4-neighbor  background. 
Contour  tracing  was  as  described  in  the  previous  section.  Every  partial 
contour  in  the  example  was  found  to  have  an  end  point  that  was  an  8-neighbor 
to  an  end  point  of  a  partial  contour  in  the  adjoining  subimage.  The  found 
contours  are  shown  in  Figure  9.14.  Except  for  inadequate  resolution  in  the 
input  image  near  the  left  edge,  all  contours  were  correctly  extracted. 
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Figure  9.13  Binary  image  resulting  from  EGT-based  segmentation 
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0.6  Analysis  of  Non-Overlapped  Method 

Choice  of  Parallel  Architecture.  Details  of  the  computation  in  each 
PE  for  the  EGT  algorithm  (Figure  0.4)  are  quite  similar.  This  is  particularly 
true  during  the  Sobel  edge  image  generation  and  determining  the  local  minima 
and  maxima.  Thus,  a  single  instruction  stream  will  be  relatively  efficient. 
SIMD  mode  operation  facilitates  the  PE)-to-PE  communication  necessary  when 
subimage  border  pixels  within  each  PE  must  be  processed.  Transmission 
delays  incurred  due  to  PErto-PE  data  transfers  can  be  overlapped  with  data 
processing  to  reduce  total  execution  time. 

PE>to-PE  communications  in  MEMO  mode  require  explicit  synchronization 
between  the  two  processors  transferring  data.  Thus,  SIXfD  mode  transfers 
should  be  used  when  possible  for  large  transfers  of  data  to  efficiently  provide 
each  PE  with  the  one-pixeUwide  border  points  of  its  augmented  subimage. 
However,  once  each  PE  has  all  of  the  data  it  needs  to  perform  the  EGT 
algorithm,  the  calculations  could  proceed  in  MIMD  mode. 

There  are  two  aspects  of  the  EGT  algorithm  less  well  matched  to  SIMD 
parallelism.  One  is  the  square  root  calculation  in  the  Sobel  operator.  Square 
root  is  usually  performed  in  software  with  an  iterative,  data-depeudent 
execution  time  algorithm.  There  will  be  lost  processor  time  for  synchronizing 
after  the  operation.  The  other  is  the  fact  that  each  PE  performs  the  innermost 
for  loop  in  Figure  9.4  using  a  different  hcalmin  and  localmax  and,  hence, 
performs  the  statements  in  the  loop  (updates  the  sums)  different  numbers  of 
times.  PEs  are  disabled  when  they  finish  their  kxjping.  since  PEs  must  remain 
synchronized  in  SIMD  mode.  The  total  time  to  perform  the  innermost  for  loop 
is  the  maximum  time  taken  by  any  PFv 
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Although  MIMD  mode  would  make  the  execution  of  the  innermost  for 
loop  (Figure  9.4)  more  efficient  (because  no  PEs  would  be  disabled),  this 
advantage  must  be  weighed  against  the  extra  time  involved  in  switching  from 
SIMD  to  MIMD  mode  and  requiring  that  each  PE  perform  its  own  control  flow 
operations  for  the  outer  two  for  loops.  Control  flow  operations  include 
initialization  and  incrementing  of  loop  counters,  evaluation  of  conditional 
expressions,  and  branching.  These  operations  are  performed  by  a  system 
control  unit  in  SIMD  mode  for  the  outer  loops  and  can  be  overlapped  with  the 
PE  operations.  Overall,  SIMD  mode  execution  seems  more  advantageous  for 
the  EGT  algorithm. 

The  number  of  PEs  to  use  is  an  important  consideration  for  the 
implementation  of  contour  extraction  on  a  parallel  computer.  Consider  first 
using  few  PES.  As  the  number  of  PEs  (N)  is  decreased,  subimage  size  increases 
for  a  fixed-size  total  image  size.  For  large  subimages,  the  ratio  of  subimage 
border  pixels  to  total  pixels  is  low.  This  increases  processing  efficiency  because 
inter-PE  transfers  of  border  pixel  values  take  up  only  a  small  fraction  of  the 
total  time.  The  speedup  factor  is  the  ratio  of  execution  time  on  a  serial 
computer  of  power  equivalent  to  one  PE  to  execution  time  with  a  parallel 
computer  of  N  PEs.  It  approaches  N  for  arithmetic  operations  for  large 
subimages,  A  speedup  factor  of  N  is  considered  optimal  because  it  indicates 
that  N  PEs  are  accomplishing  N  times  the  work  of  a  single  PE  in  a  given  time. 
However,  if  image  average  brightness  is  markedly  uneven,  subimage  size  cannot 
be  arbitrarily  large  without  compromising  segmentation  accuracy.  Also,  very 
large  subimage  size  may  potentially  degrade  performance  due  to  the  finite  size 
'>f  pliysieal  main  memory  in  any  computer;  for  better  performance,  subimage 
size  should  be  restricted  so  as  to  avoid  paging  while  processing  a  subimage. 
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Consider  now  a  large  number  of  PEs  applied  to  the  contour  extraction 
problem.  As  N  is  increased,  the  subimage  size  decreases.  In  this  case,  the  ratio 
of  border  pixels  to  total  subimage  pixels  is  high  and  inter-I’E  transfers  account 
for  a  large  percentage  of  total  processing  time.  Total  processing  time  is 
minimized  as  increases,  however,  the  speedup  factor  decreases  (much  time  is 
spent  on  transfer  operations  between  PEs,  which  are  not  needed  for  the  serial 
machine).  Too  large  a  value  of  N  will  result  in  too  small  a  subimage  size  for 
good  accuracy  in  the  ECT  step.  Very  small  subimages  may  contain  no  true 
contours  yet  report  false  ones  because  of  image  noise.  Also,  use  of  a  large 
number  of  PEs  may  be  inefficient  for  contour  tracing  if  it  reduces  the  number 
of  complete  contours  found  in  Phase  I.  requiring  heavier  use  of  inter-PE 
communication  to  close  contours  in  Phase  II  Note,  however,  that  if  most 
objects  span  many  subimages  few  closed  contours  will  be  found  in  Phase  I, 
regardless  of  subimage  size  (recall  the  circuit  board  example). 

This  conflicting  demand  on  N  clearly  shows  the  value  of  considering  the 
algorithms  comprising  contour  extraction  as  a  whole,  rather  than  as  a  sequence 
of  individual,  independent  algorithms.  I’erformance  measures  that  quantify 
various  demands  [SSS82]  can  aid  in  determining  the  value  of  N  to  use.  The 
value  of  N  should  be  chosen  given  knowledge  of  the  specific  performance 
characteristics  of  a  particular  parallel  processor  implementation. 

The  activity  during  contour  tracing  is  strongly  dependent  on  the  details  of 
the  input  image;  each  PE  has  a  set  of  contours  and  partial  rontf)urs  that  is  in 
all  likelihood  unique.  I’hase  1  of  contour  tracing  reejuires  only  locally  available 
data,  and  executinn  time  is  data  dependent  Phase  II  makes  heavy  use  of 
non-Iocal  data  and  also  ha.s  data-dependent  exenition  tiimv  Linking  contours 
is  feasible  only  as  an  asynchronous  task  in  which  each  processor  proceeds  with 
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regard  to  its  circumstances.  Both  phases  of  contour  tracing  are  suited  to 
NIIMD  mode.  The  fraction  of  the  total  execution  time  for  contour  extraction 
spent  on  contour  tracing  is  typically  small. 

Comparison  to  Overlapped  Approach.  The  advantages  of  the 
(iverlapped  approach  presented  in  Section  9  3  versus  the  non-overlapped 
api'Toach  are  twofold  One,  partial  contour  extension  is  not  necessary.  This 
gre'itly  simplifies  the  contour  tracing  process.  Tw'o,  only  sufficient  primary 
!!)•  rie  ry  tn  hold  one  subimage  for  processing  is  needed. 

The  disadvantages  of  the  overlapped  approach  are  threefold.  First,  the 
ni.ixinitim  si/e  of  an  object  of  interest  must  be  known  a  priori  so  that 
sijbimage  size  can  be  established  Subimage  size  is  constrained  by  the  fact  that 
f.Xi'T  performance  tends  to  degrade  with  increasing  subimage  size  relative  to 
image  size  Thus,  there  will  be  a  practical  limit  on  the  maximum  object  size. 
.Also,  an  object  may  happen  to  be  completely  within  more  than  one  subimage. 
This  must  be  recognized  and  handled  appropriately.  Second,  thresholding 
(including  FtiT)  tends  to  perform  less  well  when  objects  are  small  relative  to 
the  image  (in  this  case,  subimage)  size  The  overlapped  method  requires  that 
objects  c(ivcr  no  more  than  one  fourth  the  area  of  a  subimage.  Thus,  there  is  a 
practical  limit  to  the  range  of  object  sizes  that  will  be  bandied  well  for  a  given 
iMiagc  objects  much  smaller  than  the  largest  objects  are  at  a  disadvantage. 
Fmall;.,  each  pixel  is  processed  for  contour  extraction  four  times,  once  for  each 
of  the  fc’iir  overlapping  subimages  to  which  it  belongs. 

The  non-overlapped  scenario  could  be  implemented  on  a  serial  computer 
sxsti'iii  with  virtual  mernor)  [DenTO)  The  disadvantage  of  this  approach  Ls 
that  wliin  .1  Contour  spans  more  than  one  subimage,  linking  partial  contours 
requires  that  a  representation  of  the  subimages  with  partial  contours  to  be 
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linked,  as  well  as  the  associated  contour  table  information,  be  accessible.  This 
may  result  in  significant  delay  due  to  paging  subimages  into  primary  memory. 
Paging  overhead  does  not  occur  on  a  parallel  system  since  the  entire  image  Is 
stored  in  primary  memory.  Thus,  it  is  the  multiplicity  of  primary  memories 
(the  large  primary  memory  space)  in  a  parallel  system  such  as  PASM  that 
makes  the  non-overlapping  subimage  approach  practical 

The  disadvantages  of  the  non-overlapped  method  are  that  contour  tracing 
is  more  complex  and  sufficient  memory  to  process  an  entire  image  is  needed. 
The  advantages  of  the  non-overlapped  algorithms  are  fourfold.  There  is  no 
limit  to  maximum  object  size,  each  pixel  is  processed  just  once,  threshold 
choices  may  be  made  more  precisely  because  subimages  that  are  small  relative 
to  the  largest  objects  can  be  readily  used,  and  they  are  well-suited  for  parallel 
processing. 

0.7  Implications  for  Network  Design 

When  subimage  border  pixels  are  needed  for  the  F(iT  algorithm,  all  PEs 
simultaneously  request  the  same  border  pixel  relative  to  their  subimages.  For 
example,  when  I-CT  begins  (with  the  upper  left  corner  subimage  pixel)  all  PEs 
will  request  (from  the  f’E  to  their  upper  left)  the  pixel  immediately  above  and 
to  the  left  of  their  upper  left  corner  pixel  This  data  transfer  can  occur  for  all 
PEs  simultaneously.  For  contour  tracing,  the  communication  needed  Ls 
between  8-neighbors  PfC.S  and  arbitrary  others  These  communication  links 
must  be  established  to  support  partial  contour  linking  Hence,  the 
interconnection  network  sltould  be  abb-  to  pf-rform  permut  .u  ions  involving  the 
8-neighbors  rtf  a  f’F.  as  well  as  any  one-lo-on<‘  '■onni'rtioa 


Because  both  SINID  and  MIMD  modes  of  processing  are  involved  in  the 
contour  extraction  task,  the  interconnection  network  should  be  able  to  function 
well  in  both  modes  Networks  that  are  controlled  by  routing  tags,  or  in  some 
utlur  distributed  manner,  can  more  easily  operate  in  MIMD  mode  than 
centrally  controlled  networks.  7'his  is  particularly  true  when  the  patterns  of 
ii;!(  rconriection  change  rapidl).  a.s  is  the  case  in  Phase  11  contour  tracing. 

I  tie  PP-t.*  PK  transfer  of  information  must  be  efficient  or  the  system  will 
b  slowed  In  its  simjdest  form,  this  communication  would  be  handled  by  the 
PIN,  each  PP  would  cr)ntrol  the  netwf)rk  and  preform  all  the  network  protocol 
support  The  routing  tag  control  scheme  for  the  ElSC,  for  example,  would  be 
ea.s.\  for  the  PPs  to  implement 

I’ach  word  transferred  and  each  new  network  setting  can  require  processor 
iri't  ructions  A  more  efficient  method  of  PE-to-PE  communication  is  by  direct 
riteuuiry  access  (1)M.A).  DM.A  is  a  method  for  storing  or  retrieving  data 
without  processor  intervention  In  its  basic  form,  a  PE  would  enter  a  DMA 
handling  routine  upon  request  from  another  PH  This  routine  computes  the 
1  (  al  memory  addres.s  range  of  the  data  satisfying  the  request  (e  g.,  partial 
ni  Hir  i-x-)  coonlmates).  sen  ds  th  IS  information  to  the  special  DMA 
liardwart  and  s'ts  the  network  as  needed  for  the  transfer.  The  DMA 
li.irdware  then  wa)uld  autonomously  retrieve  the  information  from  local  memory 
and  perform  the  necessary  network  interfacing  to  send  the  data  to  the 
reouesting  F’P  D.M.A  hardware  is  often  designed  to  operate  on  a  cycle-stealing 
b.a.  I'  so  that  PE  access  to  local  memory  is  not  severely  affected  The  PE  is  still 
ri'^poiisible  however,  for  checking  the  incoming  data  (after  transfer  is 
ompl<‘t(  )  for  transmission  errors  and  so  forth 
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A  more  advanced  implementation  of  I).\L\  capability  is  the  use  of  an 
intelligent  network  interface  unit  (Nil  ).  Requests  tor  data  from  remote  PEs 
would  be  made  to  the  local  Nil',  which  would  interpret  and  satisfy  the  request 
by  coordination  with  a  remote  Nil'.  The  Nil'  would  combine  DM.A  capability 
with  network  protocol  support.  \TSI  technology  may  allow  ready  fabrication 
of  sophisticated  Nil's. 

Consider  the  transfer  of  subirnage  bound.ary  points,  and  for  the  sake  of 
example,  let  R  =  1096.  N  =  1021,  and  R/v^  =  128  Ratlu'r  than  involve  the 
source  and  destination  PIN  of  oarh  fransb'r  in  that  transaf'tion  for  ear-h  pixel, 
one  of  the  DM.'X  modes  just  described  would  be  of  great  use  If  pixels  were 
stored  in  PEs  by  row  (rows  numbered  0  to  127),  and  transfer  from  PE;  to 
PEi  +  ^^  was  selected,  the  DMA  hardware  of  PE,  would  be  instructed  to 
transfer  128  pixels  starting  at  the  address  of  row  127  of  the  subimage.  The 
DNLA  hardware  .associated  with  would  be  set  to  read  128  pixels  from 

the  network  and  store  them  beginning  at  an  address  representing  row  -1, 
relative  the  the  subimage  of  1*1',+ When  data  is  transferred  from  PE;  to 
PEi  +  i,  the  situation  is  more  (om['li  ''ted  m  that  pixels  to  be  transferred  are 
not  stored  contiguously  in  the  niemor>  .</  i’J  ,  >  onv  enijonal  DMA  hardware 
only  supports  physic.a'  block  trari'-f'T'.  of  dat.a  ibr'  ,a  strong  ca,se  for  an 
intelligent  Ml  can  be  made  an  MI  could  acc-pt  niore  complicated 
instructions  such  a.s  'transfer  12''  pivcN  starting  at  ?iddr<‘s.s  X,  taking  every 
128th  pixel." 

The  interconnect  ion  network  and  anv  DMA  or  MI  hardware  wmild  be 
heavily  used  in  I’h.a.'-e  II  jirocesMug  vshen  1*1^  e\ 'einlinc  p:\rnil  cont'mr-'  probe 
remote  PE  memories  that  may  contain  the  i.'ti'^ion"  of  liic  partial  •onto\irs 
As  in  the  ECT  algorithm,  the  Nil  would  be  of  great  use  because  it  could 
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process  queries  about  possible  extensions  to  partial  contour  without 
interrupting  the  remote  PE  There  would  be  a  combination  of  short  and  long 
messages  between  PEs  during  this  phase.  A  short  message  would  occur  when  a 
PE  extending  a  partial  contour  requests  information  about  possible  extending 
partial  contours  from  a  remote  PE.  If  a  connecting  partial  contour  b  found,  a 
i  I'ig  riKss.ige  consisting  of  the  i-x-v  sequence  of  the  partial  contour  would  be 
sent  T  hus,  the  interconnection  network  should  support  a  variety  of  message 
so  that  the  etTiciency  of  sending  either  type  of  message  b  high. 

0.8  LmpHcations  for  System  Design 

1  he  IXJT  and  contour  tracing  algorithms  were  found  to  be  clearly  suited 
for  Sl.MD  .and  MIMI)  mode  processing,  respectively.  Probably  the  most  basic 
requirement  for  the  operating  system  and  the  system  hardware  b  to  allow  a 
single  job  to  switch  dynamically  between  SIMD  and  MIMD  modes  of  operation. 
\\  ith  f.nly  SIMI)  capability,  vast  inefficiency  would  occur  in  later  stages  of  the 
scenario  Having  only  MIMI)  mode  b  a  less  serious  handicap,  but  will  lengthen 
\(  111  ion  time  for  the  Sobol  operator  and  determining  local  maxima  and 
mniima.  due  to  the  need  for  explicit  synchronism  and  data  sharing,  and  the 
inil'ility  to  perform  loop  control  in  a  central  control  unit  while  PE 
' irii[)Ut  .ition  occurs  Thus,  the  capability  to  switch  dynamically  between  SIMD 
and  .\11M1)  modes  is  important  so  that  each  subsequent  portion  of  the  scenario 
in  be  executed  in  the  most  appropriate  operational  mode. 

Ihe  PEs  of  an  NfIMI)  parallel  computer  must  be  able  to  operate 
II. dependent  ly  (ui  their  own  data  anti  instruction  streams,  and  so  are  commonly 
ifueighi  of  ;ls  eoiiqdete  comi'uter  systems  in  their  own  right  with  input/output 
.  ireuitrv  ticClting  a  working  environment  in  which  each  PE  b  closely  coupled 
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to  others.  The  appropriate  nature  of  the  processor  in  an  SIMD  system  PE  is  a 
more  open  question.  At  one  extreme,  a  PE  could  consist  of  only  an 
arithmetic/logic  unit  (ALU),  a  microcode-level  interface  to  the  system  control 
unit,  basic  input/output  circuitry,  and  an  interface  for  processor  status 
information.  At  the  other,  the  SIMD  PE  would  have  the  functionality  of  the 
MIMD  PE,  but  would  lack  instruction  fetch  capability  and  would  receive 
machine  language  commands  from  its  control  unit  rather  than  from  its  local 
memory.  Of  course,  a  PE  can  combine  all  the  capability  of  a  pure  MIMD  PE 
with  the  ability  to  disable  the  program  counter  and  accept  external  machine 
language  instruction  streams.  Such  a  PE  design  would  be  suitable  for 
SIMD/MIMD  systems,  including  PASM.  It  might  be  thought  that  frequent 
operation  of  such  a  PE  in  SIMD  mode  might  underutilize  its  capabilities. 
Study  of  the  EGT  algorithm  shows  that  this  is  not  necessarily  true,  i.e., 
executing  EGT  in  SIMD  mode  enhances  overall  performance. 

Considering  the  EGT  algorithm,  there  are  two  steps  with  data-dependent 
times:  computing  a  square  root  for  the  Sobcl  algorithm  and  the  innermost  for 
loop  (see  Figure  9.4),  Removing  the  variability  and/or  shortening  the 
execution  time  of  these  steps  would  improve  the  SIMD  execution  of  the 
algorithm.  In  the  case  of  the  square  root  operation,  a  fixed  time  solution  either 
in  the  form  of  a  new,  data-independent  algorithm  or  special  PE  hardware 
support  is  one  answer.  The  algorithm  could  be  executed  by  having  a  .system 
control  unit  issue  the  instructions,  or  the  algorithm  could  be  embedded  in  PE 
microcode,  giving  each  PE  a  “square  root”  instruction.  Another  answer  would 
be  to  use  a  modified  Sobcl  operator  formulated  without  the  square  root 
operation:  accuracy  considerations  might  rule  out  this  option. 
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The  innermost  for  loop  can  probably  benefit  most  from  reducing,  rather 
than  fixing,  execution  time.  If  each  PE  could  maintain  its  own  loop  counters 
instead  of  relying  on  a  control  unit,  execution  time  could  likely  be  reduced. 
Incorporating  local  index  registers  in  each  PE  would  support  this.  Because  PEs 
will  execute  this  loop  different  numbers  of  times,  as  well  as  for  different  ranges 
of  loop  index  values,  there  must  be  a  mechanism  to  disable  PEs  that  finish  the 
loop  before  others.  One  method  to  do  this  uses  processor  address  masks 
[Sie77a|. 

Two  alternatives  were  presented  for  determining  when  a  PE  can  enter 
Phase  n.  With  the  first,  PEs  are  allowed  to  start  Phase  D  processing  after  all 
have  completed  Phase  I.  Hence,  the  operating  system  and  hardware 
communication  paths  must  allow  for  synchronization  of  the  PEs  within  an 
MI.MD  job. 

Since  semaphores  play  a  large  part  in  ensuring  correct  linking  of  partial 
contours  in  Phase  D,  processors  should  be  able  to  perform  a  “test-and-set" 
operation  to  facilitate  correct  semaphore  implementation.  Test-and-set  is  an 
uninterruptible,  or  atomic,  action  in  which  the  value  of  a  data  item  in  memory 
is  read,  compared,  and,  upon  meeting  the  test  of  the  comparison,  a  new  value 
is  .stored  in  place  of  the  original  data  item  value.  While  methods  to  implement 
such  a  facility  are  known  for  sequential  processors,  the  parallel  processing 
environment,  in  which  the  item  to  be  tested  may  be  remote  to  the  PE,  presents 
a  greater  challenge.  An  intelligent  NIU  at  the  destination  PE  could  perform  an 
atomic  "test-and-set”  instruction  for  a  requesting  PE. 

Once  all  PEs  have  completed  Phase  D  of  the  contour  tracing  algorithm, 
the  job  is  done.  In  order  for  the  PASM  System  Control  Unit  to  know  that  the 
job  ha.s  been  completed,  each  PE  must  be  able  to  signal  its  Micro  Controller 


when  it  has  finished,  and  MCs  must  be  able  to  signal  the  SCIJ. 

0.0  Conclusions 

A  number  of  observations  were  made  and  conclusions  drawn  from  the 
contour  extraction  scenario.  In  particular,  non-overlapped  formulation  of  the 
algorithms  for  parallel  processing  was  shown  to  lead  to  several  advantages 
including  speedup,  elimination  of  object  size  constraints,  and  potential  for 
improved  accuracy.  Necessary  protocol  and  mechanisms  to  support  their 
correct  execution  were  detailed.  Overall,  the  parallel  algorithms  presented  are 
strong  contenders  to  replace  previous  methods  in  some  applications. 

One  such  application  is  quality  control  inspection  of  printed  circuit  boards. 
Large  object  handling  capability  is  needed  for  following  long  circuit  traces,  and 
sufficient  speedup  is  necessary  for  timely  response.  Other  applications  include 
those  involving  military  environments  in  which  real-time  processing  capability 
is  crucial. 

Considering  an  entire  scenario  in  the  light  of  parallelism  is  a  useful 
approach  for  matching  image  processing  tasks  and  parallel  architectures. 
Operating  system  features  found  to  be  important  and/or  useful  include  support 
for 

1.  dynamic  SIMD/MIMD  mode  switching, 

2.  synchronization  within  MIMD  proces.ses, 

3.  semaphores,  and 

4.  access  to  the  local  memories  of  one  PE  by  another  PE 
Necessary  and/or  desirable  hardware  features  include 
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1.  aD  interconnection  network  that  supports  permutation  connections 

between  8-neighbor  PEs  and  direct  connection  between  non-8-neighbor 
PEs, 

2.  an  interconnection  network  equally  effective  in  SIMD  and  MIMD 

environments  (implies  decentralized  network  control)  and  with  widely 
varying  message  lengths, 

3.  an  intelligent  network  interface  unit, 

4.  a  test-and-set  instruction, 

5.  PE  designed  for  constant-time  instructions  and  fast  loop  index 
manipulation, 

6.  provision  for  PEs  to  signal  job  completion  to  a  system  controller,  and 

7.  provision  for  selective  PE  enabling/disabling. 

The  stated  operating  system  and  hardware  features  are  consistent  with  the 
capabilities  and  design  of  both  the  ESC  network  and  PASM. 
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CHAPTER  10 
CONCLUSIONS 

This  research  has  addressed  three  main  topics  in  the  area  of 
parallel/distributed  processing  computer  systems  Specifically,  a  fault- tolerant 
multistage  interconnection  network  was  introduced,  the  performance  of 
alternative  switching  element  designs  suited  for  M^SI  implementation  was 
analyzed,  and  a  parallel  scenario  for  digital  image  contour  extraction  was 
studied.  These  three  topics  are  unified  by  their  joint  contribution  to  the 
engineering  knowledge  necessary  for  parallel/distributed  computer  system 
design. 

The  demand  for  very  high-speed  computing  motivates  this  entire  area  of 
research  There  are  many  disciplines  with  important  problems  that  can  neither 
be  feasibly  solved  given  the  performance  level  of  available  computers,  nor  be 
solved  without  comjiuters. 

A  sample  of  parallel  processing  computer  research  projects  was  presented 
in  Chapter  2.  The  f’ASM  multi  microprocessor  system  was  among  those 
described.  Unique  features  of  F’.ASM  include  being  (1)  dynamically 
reconfigurable,  (2)  able  to  operate  in  either  SIMD  or  NflMD  mode  of 
parallelism,  and  (3)  able  to  be  partitioned  into  machines  of  different  sizes,  each 
of  which  can  operate  in  SIMI)  or  MINI!)  mode.  ICach  of  the  systems  surveyed 
uses,  or  is  designed  to  use.  a  multistage  interconnection  network.  Together 
they  illustrate  a  range  of  architectural  ideas  and  a  range  of  intended 
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application  domains.  Thus,  this  material  provides  a  context  in  which  to  view 
the  interconnection  network  and  image  processing  studies  of  this  research. 

The  Extra  Stage  Cube  is  a  new  fault- tolerant  multistage  interconnection 
network  intended  for  use  in  large-scale  SIMD,  MSIMD,  MIMD,  and 
partitionable  SI^^D/^^I^^D  computer  systems,  and  is  the  network  to  be  used  in 
the  P.4SM  prototype  under  construction  at  Purdue  University.  Chapter  3 
defines  the  Generalized  Cube  network  as  shows  how  the  ESC  is  derived  from  it. 
The  Generalized  Cube  was  noted  to  be  representative  of  a  class  of  cube-type 
networks,  including  the  STARAN  fiip,  the  omega,  the  indirect  binary  n-cube, 
and  the  SW-banyan  (S=F  =  2,  L=n)  networks. 

A  formal  development  of  the  properties  of  the  ESC  was  set  forth  in 
Chapter  4.  The  basic  concepts  of  network  fault  model  and  fault- tolerance 
criterion  were  defined,  and  their  meaning  for  the  ESC  chosen  to  reflect  the 
anticipated  operating  environment  of  an  interconnection  network,  rather  than 
for  ease  of  analysis.  The  ESC  was  proved  to  be  single-fault  tolerant  with 
respect  to  its  fault  model  and  fault-tolerance  criterion.  A  simple  method  was 
given  for  finding  a  fault-free  path  through  a  faulted  network.  The  ESC  was 
shown  to  tolerate  some  instances  of  multiple  faults.  Routing  tags  were 
developed  that  allow  full  access  to  the  fault- tolerance  capabilities  of  the  ESC. 
Its  partitioning  and  permuting  were  capabilities  were  described. 

Because  the  field  of  fault-tolerant  multistage  interconnection  networks  for 
parallel/distributed  computers  is  a  new  one,  and  because  the  ESC  was 
described  in  the  early  literature,  the  survey  of  such  networks  was  presented  in 
Chapter  5,  following  the  chapter  describing  the  ESC.  The  challenging  design 
goals  chosen  for  and  met  by  (he  ESC  result  in  its  retaining  its  merit  as  new 
networks  have  been  proposed.  The  ESC  was  compared  with  the  surveyed 


fault-tolerant  networks.  The  ESC  is  either  less  physically  complex  and  equally 
fault-tolerant  or  more  fault-tolerant  than  other  fault-tolerant  multistage 
interconnection  networks  that  have  been  reported  in  the  literature. 

Chapter  6  presented  a  study  of  the  terminal-pair  reliability  of  the  ESC’. 
Reliability  was  measured  as  the  probability  that  there  exists  at  least  one  path 
between  any  network  input  and  output.  This  extended  the  analytical  work  of 
Chapter  4  establishing  single-fault  tolerance.  A  exact  solution  for  the  case  of 
two  faults  is  developed  based  on  the  ESC  fault  model  stated  in  Chapter  4 

From  the  reliability  study  of  the  ESC,  opportunities  for  improved  fault- 
tolerance  at  minimal  additional  hardware  or  operational  protocol  overhead  cost 
were  recognized.  An  Enhanced  ESC  was  described  and  its  capabilities  studied 
in  Chapter  7.  Large  networks  were  found  to  benefit  more  than  small  ones 
from  this  modification. 

Alternative  structures  to  traditional  interchange  boxes,  motivated  by  M^Sl 
considerations,  were  presented  in  Chapter  8.  The  performance  of  a  4x4 
t'ossbar  node  and  a  composite  node  formed  from  four  2x2  crossbar  elements 
was  studied.  Implementation  of  an  ESC’,  or  ESC'-like,  network  using  4x4 
switches  was  discussed. 

Chapter  9  considered  digital  image  contour  extraction,  an  image 
processing  task  within  the  application  domain  for  r.\SM  and  with  wide-spread 
use.  The  value  of  the  scenario  approach  was  that  it  provided  insight  not 
available  from  an  individual  consideration  of  the  algorithms  comprising  the 
contour  extraction  task.  Numerous  system  design  guidelines  were  uncovered 
for  the  operating  system,  processing  element  design,  and  in’frconnection 
network  requirements.  In  addition,  the  parallel  formulation  of  the  task 


provided  the  opportunity  for  faster  processing  speeds,  elimination  of  object  size 
constraints,  and  improved  accuracy. 

In  summary,  the  contributions  of  this  work  are  threefold.  The  6rst  is  the 
development  and  study  of  the  Extra  Stage  Cube  fault-tolerant,  multistage 
interconnection  network.  The  second  is  the  analysis  of  the  performance  of 
alternative  switching  element  structures  for  interconnection  networks.  The 
third  is  the  examination  in  detail  of  an  image  processing  scenario  for  PASM 
and  the  ESC  network. 
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ProfTka  to  dlroctlj  oBoaorkt*  tkvlt  palT*  Ukd  lo»t;  pain  tor  tko  Z8C 
tlth  STAGE  bfpkfiiB(. 

Ufod  to  coofira  th*  •qnatioai  for  eosatlaf  thoM  objoet*. 

Rob  (ItIbr  tb*  tbIbb  of  b  bb  tbB  argaaBBt. 

lARIIIG  >>  For  B  •  6  roB  tiao  li  12S.0  CPU  HOURS  « 

For  B  =  6  roB  tlao  !■  ilightlp  orar  4.S  koari 

For  B  =  4  rBB  tlao  It  tllgbtlj  ortr  11  alBBtBt 

For  B  =  8  roa  tiao  it  18  tacoBdt 

For  B  =  3  roB  tlao  it  Ibbb  tbaa  1  tBcoad 


OptioBi:  -1  rorboBB  llftB  iBforBBtloa  ob  bbcR  foalt  pair  (dltablad); 

BtBd  for  dabagglBg 


Varlablat:  B 

■ 

bbfaalt 

bbloBtp 

Ibfaalt 

Iblottf 

llfaalt 

lllotip 

aatk 


Baabar  of  aataork  ttigat 

(tot  eoaatlBg  astra  ataga) 

aaabar  of  aattork  lapata 

Baabar  of  m-faalt  palrt 

aaabar  of  BB-loaap  palrt 

Baabar  of  LB-faalt  palrt 

aaabtr  of  LB-lotty  palrt 

aaabar  of  LL-ftalt  palrt 

aaabtr  of  IX-lettp  palrt 

eoBttast  array  coBtalBiag  bit  aaaka  aaad  to 

eoapata  tka  ttafa  oatpatt  of  tka  prlaary  aad 

taeoadary  patkt  firta  tka  t/d  pair  aad  ttafa 


tlBclada  <atdlo  b> 

lat  bada.  loat,  rarbota; 


typadaf  OBtlfBad  loaf  alat; 
alst  a.  I; 

atrnet  { 

alat  ttafa, 
nlat  Itak; 

>  bad  i4] ; 

aalaCarfc.  arfr) 
iBt  trfc; 
char  aarfrll; 

{ 

raflatar  alBt  a,b; 
raflttar  lat  l.j; 

lat  bbloatj,  Iblottj.  lllotty.  bbfaalt,  Ibfaalt,  lltaalt; 


1st  oldloii; 

Tsrbof*  =  0: 
if  {  srgc®=*  )  < 

If  (ftreipCsrfrtl] .•-t’)==0) 

Ttrboi*  -  1: 

•  Is*  < 

fpriBtf (itdsrr.'ossfs:  Is  (-t)  s\b' , srgr IO] ) ; 
sxlt(l): 

> 

BTfc — ; 

BrfT«*; 

> 

If  (  Br*el=a  )  < 

fprlBtf (itdsrr,*BiBgs:  Is  BVB'.srfrtOl); 
sxltCl} : 

> 

B  =  Btol(Brrr[ll). 

I  =  1«b; 

bblossp  =  Iblossy  =  lllossy  =  bblBaJt  =  Ibfsslt  =  llfsolt  =  0; 

/s  BB'fBslt  pslr  ssetlOB  •/ 
loss  «  0: 

for  (  l*a  ;  1>=0  ;  1—  )  < 

/s  for  til  stsgss  ♦/ 

for(B=:0  ;  »<»/2  ;%♦♦)< 

/•  for  tbs  'Bppsr*  oBtpot  of  sseb  bo*  1b  ttogs  1  •/ 
for  (  J=1  ;  J>=0  :  J--)  < 

/s  for  b11  stBgss  Isss  tbsa  or  sqatl  to  1  •/ 

for  (  b=0  ;  b<g/a;  b**  )  ( 

/s  for  tbs  ‘Bppsr*  ootputs  of  sseb  bo*  la  stags  J  •/ 

If  (  1==J  U  b<=B  ) 

COBtlBBS: 

/•  BBlsss  i=J.  tbSB  for  tbs  •appsr*  ontpnt  of  sacb  stags  J  bo*  sltb  •/ 

/•  ostpBts  grsatsr  that  a  •/ 

bbfaalt  =  bbfaslt  ■»  1; 

If  (  l--j  U  <1"B  I  1==0)  ) 

COBtlBBS : 

/s  sltb  both  fBBlts  coBflBsd  to  stag*  B  or  stags  0  tbs  BSC  coatlBoss  to  •/ 

/*  fBBCtloB;  tbsrs  Is  bo  Bssd  to  tsst  saeb  faalt  pairs  for  loss  of  BFIC  ♦/ 

If  (  (1==B  U  JI=b)  I  Cll=0  U  J"0))  < 
bblossr'''. 

COBtlBBS; 

> 

/•  a  stags  a  or  stags  0  bo*  fsolt  coablasd  sltb  a  bo*  faolt  oatsld*  of  that  •/ 
/•  stag*  is  lossy;  const  it  a*  saeb  and  go  os  to  tbs  B**t  faalt  pair  •/ 

oldloas  =  loss; 
bad [01. stags  =  1; 
badlOl.llnb  -  asrolBsd.a.a) ; 
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=  1: 

k*4(I).llBk  *  lo«ar(l.i*reiu(l,k,B).B): 
kkdt2l.«tkf*  c  J: 
kBd(2].ll>k  =  l•rolB■(J.k,B); 
kadis]. «tB(*  =  J; 

kadis}. llak  s  leBtr(j.farolaa(J,k,a),B): 
kada  =  4; 


/• 


If  (*arkoaa) 


prlattCMt  (I2d.l2d)  S  (I2d,iad)\a 

S  (I2d.l2d)  S  (l2d.l2d)\aM.kadtO].llak.l.kadIl).llBk.J.kadI2].llBk,J,kadIS].liak); 


taatO: 

if  (loBi>oldloat) 


> 


kkleatj*«: 


> 


) 

) 

prlatf (‘loabtr  BB-loaij  pain  =  Sd\a*.  kkloaap); 
prlntf  Clubar  BB-faalt  pain  :  IdVa*.  kkfaalt): 


/•  LB-ta«lt  pair  itctloa  •/ 
led  -  0; 

for  (  1=8  ,  1>0  ;  1—  )  < 

/•  for  all  (tapa  fraatar  thaa  0  (thara  ara  ao  LB-faalt  palra  tltk  •/ 

/*  tha  ‘apparaoat  faalt*  la  atafa  0)  «/ 

for(a=0  ;  a<S/2  ;  a**  )  ( 

/•  aaanaa  tha  box  faalt  la  la  ataga  1;  for  tha  ‘appar*  oatpet  of  aaeh  boa  •/ 
for  {  i-l  ;  J»0  ;  J— )  1 
/*  for  all  atagaa  laaa  thaa  or  aqaal  to  1  aad  •/ 

/a  graatar  thaa  0  (thara  ara  ao  ataga  0  llaka)  •/ 

for  (  k=0  ;  h<l;  b**  )  < 

/•  for  all  oatpatf  la  ataga  j  (ihleh  aaat  coatala  tha  llak  faalt)  •/ 

lkfaalt-»»; 

If  (  i»*B  )  < 

lHoaa7*»; 

eoatlaaa; 

> 

/•  aa  LB-faalt  pair  coatalalag  a  ataga  a  box  faalt  la  a  loaaj  pair  •/ 

/*  ao  aaad  to  taat  </ 

oldleaa  =  loaa; 

badlOl. ataga  =  1; 

bad  10].  llak  =  aarelaad.a.a): 

badll] .ataga  =  1; 

badID.llak  =  loaarll.aarelaall.a.Bl.a); 
bad [2] .ataga  =  ]; 


■( 

1 

1 


m 


m 


i 


■m' 
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bMltSl.liBk  =  b: 
b»4(  =  B: 
it  (T*rbo«») 

(I2d.l2d)  ft  (I2d,l2d)\a 

ft  (|2d.|2d)\a*,bad[0]  .ftBgt.bkdtOl . lliik.b»d(ll . atag* . bad [  1  ] . llBk.btd[2] . (tag*. bad [21 . liak) ; 

t*(tO: 

If  (10((>OldlO(() 
lblo#*/»»; 


for(a=0  ;  a<ft  ;  a-»»  )  { 

/«  aatoa*  not  that  th*  llhft  faalt  It  In  (tag*  1;  tor  aach  ttaga  1  ottpat  •/ 
for  (  j=l-l  ;  J>=0  ;  J'*)  { 

/•  tor  all  (tag**  lait  than  1  •/ 

for  (  b=0  ;  b<ft/2;  b—  )  { 

/•  for  tb*  ‘oppar*  oatpsta  of  aach  box  in  (tag*  J  */ 

Ibfaolt  =  Ibfaalt  »  1; 

If  (  J«0  )  < 

Iblottf**; 

eoatlaa*; 

) 

/*  an  UB-faalt  pair  cootaialag  a  (tag*  0  box  faalt  1*  a  lotap  pair  •/ 

/*  no  naad  to  t*(t  */ 

oldlott  a  lotc 

bad [01. (tag*  *  1: 

bad to], link  *  a; 

bad (ll. (tag*  -  ); 

badtll.llnk  -  xarolntCJ.b.n): 

bad [21. (tag*  a  j; 

bad[2l.llak  -  lot*r(J .larolntCJ .b.n) .a) : 
bad*  =  S; 

If  (aarbo**) 

printf(«fttP  (S2d.S2d)  ft  (S2d.l2d)\B 

ft  (l2d.f2d)\B',  bad[0]  .(tag*. bad [0] .  llak.badd]  .(taga.badlll .  link,  bad [2]  .(tag*. bad [2] .  link); 

t**tO ; 

if  (10((>oldlo(a) 
lblo**p»’.- 


) 

> 

> 

prlntf (‘laabar  LB-lotty  pair*  a  |d\n*.  Iblottj); 
prlatf (■laabar  LB-faalt  pair*  =  |d\a*.  Ibfaalt); 


/•  LL'taalt  pair  aactioa  •/ 
lo*(  a  0; 


'j 

J 

.( 


m 


m  I 


j 


Waiai'nW.^  h'ri  Ih,. -nl  Vi  ■fc.j.  ..V  ^  , 


330 


for  (  i=»  ;  1>0  ;  i—  )  { 

/•  tor  oil  itftgoi  iltk  llokf  (1.0. .  »eladl>t  otofo  0)  •/ 
tor(»=0  ;  o<l  :  o**)  < 

/•  for  all  oatpati  of  ftaf*  1  */ 

for  (  J*l  ;  1>0  ;j~)  < 

/•  for  all  atagti  thaa  or  tqaal  to  i  tkat  kava  llako  •/ 

for  (  k^O  :  k<a  :  k««  )  < 


/*  tor  all  ostpata  of  ataga  ]  •/ 


If  (  k<sa  it  l^j  ) 
eeatlata; 


/•  aalaat  1>J,  tkaa  tor  all  ataga  j  oatpata  graatar  tkat  a  •/ 

llfaalt  s  lltaalt  *  1: 

/•  eoaat  aaabar  of  LL-faalty  palra  •/ 

oldloas  a  loaa: 
kad(0] .Btaga  a  i; 
kad(0].llak  a  a; 
kad(l] .ataga  =  J; 
kadUI.liak  a  k: 
bada  a  2; 

/•  if  (rarboaa) 

prlatf(*tM  OUd.iad)  •  aUd.lM)\B\a 

M.a.J.k);  •/ 

taatO : 

If  (loaa  >  oldloaa) 
llleaap**: 

/«  eoaat  aaabar  of  U.'loaa;  paira  •/ 

> 

> 


> 

> 

prlatf (*laabar  LL-leaap  paira  =  MVa*.  llloaap); 
prlatf (‘laabar  LL-faalt  paira  ~  ld\a*.  llfaalt); 


olat  aaak[1  =  (O.Ozl.OxS.OxT.Ozf.Oilt.OzSf .OaTf.Oaff.Oalff.OiSff .OsTtf .Oafff . 
Oxlfff.Oxdftf.OzTfff.Oxffff.Oxtffff.OzSffft.OxTffff.Oatffft. 
Oxlfffft.OxSfffff.OxTtfftf.Oxfttfff.Ozltffttt.OzSttffft.OzTtttttt. 
Oxfffffff.Oxltffftff.OxSfffffff.OxTfffffft); 


taatO 

/• 

••  Taat  patka  tkroagk  tba  aataork  aatil  a  aoarea/daatiaatiea  pair  tkat 
••  eaaaot  eoaaaaleata  ia  foaad,  or  all  paira  kara  baea  taatad. 

*•  Aaaaaaa  tkat  tha  bad  llaka  kara  alraadj  kaaa  liatad  la  bad[a]. 

•  a 

••  Varlablaa:  a  aoarea 

•a  d  daatiaatioa 

aa  ataga  tha  aaabar  of  tba  ataga  to  ka  aaad  ia  eoapitiag  ataga 
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••  eatpsti  tor  th*  i/d  potr  ktlaf  totod 

••  liok  eatput  of  (ta(t  ■•tofo*  tak*>  bp  »  potb  of  tb«  o/d  pair 

••  bdag  ttcttd 

«•  tanltl  fl<L|  ladicatlBf  th*  o/d  pair  baa  a  laoltj  prlaarp  path 

••  fault:  tlaf  iadleatlaf  tha  a/d  pair  baa  a  faultp  aieoadarp  path 

«* 

•*  Affaeta  global  rarlabla  loaa. 

a/ 

< 

raglatar  lat  ataga.  faaltl,  faultS: 
raglatar  ulat  a.  d,  llak; 
for  (  a=0  ;  a<l  ;  a**  )  { 

for  (  d=0  ;  d<l  ;  d»»  )  < 

fanltl  =  fault:  =  0; 

for  (  ataga=u  ;  ataga>s0  :  ataga —  )  { 

Hub  =  (ataaak [ataga] )  t  (dA(>Baak(ataga])) : 

/*  eoaputa  ataga  output  of  prlaarj  path  •/ 

If  (  taultp (ataga. link)  )  < 
faultl»»; 
braak; 

> 

> 

(or  (  ataga^u  :  ataga>0  ;  ataga —  )  ( 

lluk  =  ((akaaak[ataga]) I (dA(>Baak[ataga]))}  *  1; 

/•  eoaputa  ataga  output  of  aaeondarj  path  •/ 

if  (  taaltp(ataga,llnk>  >  < 
fault:**; 
braak; 

> 

) 

If  (  tarboaa  )  < 

prlatf(*i:d  to  |:d  •.  a,  d); 
priotf(*  prlaarp;  •); 

i(((aoltt>0)  prlatf (‘faultp  at  ataga  Id.*, ataga); 
alaa  prlatf (*okap,*); 
prlutfC*  aacoudarp:  •); 

lf(faolt:>0)  prtBtf(*faaltp  at  ataga  Id,*, ataga); 
alaa  prlBt((*okap.*); 

prlBt((*  la\B*.  (faultUO  U  (ault:>0)T*LOSS*  :*IQLDSS*) ; 

) 

If  (  (aaltt>0  U  faolt:>0  } 
loaa**; 


f aultpCataga , lluk) 


/• 

**  Cb«ek  k  itki*  o«tp«t  kfklut  tk*  Hat  of  tkolt  Ikkala. 

•  •  It  tk«r*  1(  k  akteh,  ratkn  k  1:  tit*  r*tar>  0. 

•/ 

klat  tttft.llkk: 

< 

ikt  i: 

for  (  1=0  ;  l<kadt  :  ) 

If  (  ttkf*=bkd[i]  .ttkf*  tk  llt]c==kkdtl]  .llkk  ) 
rttkra  1; 

rttkTk  0: 

} 

lot«r(itkf*.llBk.a) 

/• 

••  G*B*rkt*  tht  cerr*a^kdla(  =loi*r*  boa  eotptt  Ikbtl 
•*  (iTta  th*  Bttf*  aoabtr,  Ikbtl  of  tk*  ‘o^por*  oat^at. 

••  kkd  tk*  Bkabtr  of  tt*(*t 

•/ 

kiat  ttkg*,  liak,  a; 

I 

if  (ttkf*  ==  a) 

rttoradlak  *  1); 

*lt* 

rttoradlak  *  (1  «  ttkg*}): 

> 

t*roia*(ttkg*,liok,k) 

/• 

••  Ctatrkt*  tb*  *«pp*r*  boa  oatpot  Ikbtl  glrta  tb*  atabtr  of  tb* 
*•  ttkg*  eoatklalag  tb*  boa,  tb*  ■aaabtr*  of  tb*  boa  altbla  tbkt 
••  ttkg*  (eooatlBg  troa  ttro  altk  tb*  'kpp*r*  boa  la  tb*  ttag*). 
••  kad  tk*  aaabtr  of  ttkg** 

•/ 

olat  ttkg*.  liak,  a: 

{ 

alat  apptr.  loitr; 

If  (ttkg*  ==  a) 

ttkg*  =  0: 

apptr  =  (llkk  •  (*>0  <<  ttkg*))  *<  1; 
le**r  =  liak  t  <<  ttkg*); 
r*tara(  apptr  I  lettr  ): 

> 


rogTu  to  dlrictly  •*«a«ratt  t»«lt  yolr*  knd  loatj  y»lri  for  Ik*  ESC 
ic  boi  bypaatlsf. 

'T  ts  coafiro  tb*  •qaatloa*  far  aoaatiaf  tk***  obJ*et*. 
th*  rala*  of  a  a*  tk*  arfaaaat. 

>>  For  a  =  8  raa  tla*  1*  8.8  CFU  DAYS  << 

For  a  >  E  raa  tla*  1*  roafkly  8  koar* 

For  0=4  raa  tla*  1*  alltktlj  aadar  IS  alaat** 

For  a  =  8  raa  tla*  1*  39  **ooad* 

For  a  =  3  raa  tla*  1*  1***  tbaa  1  **eoad 


-iTjas:  -T  T«rbo** 


lilt*  laforaatlOB  oa  *aek  faalt  pair  (dlaaklad); 
B**d  for  dabagglec 


^rlibl**:  a 

■ 

bbfaalt 

bbloaij 

Ibfaalt 

Ibloaay 

llfaalt 

llloaap 

aaak 


aaabar  of  aataork  ataf** 

(aot  eoaatlBf  axtra  ata(*> 
aaabar  of  aataork  lapata 
aaabar  of  BB-faalt  pair* 
aaabar  of  BB-lo**p  pair* 

Baabar  of  LB-faalt  pair* 

Baabar  of  LB-loasp  pair* 

aaabar  of  U.-faaU  pair* 

aaabar  of  LL-lo*ap  pair* 

eoaataat  array  eoatalalaf  bit  aaak*  aaad  to 

caapat*  tk*  ataf*  oatpata  of  tk*  prlaary  aad 

aaooadary  patk*  glraa  th*  a/d  pair  aad  atag* 


tidt  <(tdlo.h> 

i  loi*.  T*rboi«; 
ccilgaad  loag  alat; 


.  ,  < 

ulat  ataga: 
niat  llak; 

.:4l; 

.r*c,  argr) 

•irg';. 


r«gl*t*r  alat  a,b; 
r*gl(t*r  lot  l.J: 

lat  bbloaay.  Ibloaay,  llloaay,  bbfaalt,  Ibfaalt,  llfaalt; 
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iat  oldloaa; 

T«rbo(*  =  0; 
if  (  argc==S  )  < 

if  (•tPCBp(»rgT[t],*-T*)~0) 

Ttrboft  =  1; 

•If*  < 

fpriatf (ftdcrr.'offg*;  !•  (-•]  a\B*.frgf [OD : 

•xit(l); 

> 

arge— ; 

•rgr*-*; 

> 

if  (  argcl=3  )  < 

fprlatf (itdtrr.'sffg*:  f«  a\B*.fTgT[0]): 

•zit(l); 

> 

B  =  BtoKargrtl]}: 

I  =  1«b; 

bbloffj  =  Iblofiy  =  lllotiy  =  bbfaolt  =  Ibfaslt  =  llfaslt  -  0; 

/•  BB-ffall  pair  itctioB  •/ 
lofi  -  0: 

for  (  1-B  ;  i>=0  ;  1 —  )  { 

/•  for  all  ftagff  •/ 

for(a=0  ;  B<lt/J  ;••»♦)< 

/•  for  tb«  *app«r'  oatpat  of  taeb  box  is  (tag*  i  */ 
for  (  J=1  ;  J>=0  ;  J--)  < 

/*  for  all  ttagtf  !•«(  thaa  or  aqaal  to  i  •/ 

for  <  b=0  ;  b<l/2;  b—  )  < 

/•  for  tb«  *Bpp«r'  OBtpat*  of  tacb  box  la  (tag*  J  •/ 

if  {  1==J  U  b<=a  ) 

coatlBB*; 

/•  aaltff  i=J,  tb*B  for  tb*  'Bppar*  oatpnt  of  aacb  itag*  J  box  iltb  •/ 

/•  oatpatf  gr«at«r  that  a  */ 

bbfaBlt**; 

if  (  1==J  U  (l==o  I  1==0)  ) 

COBtiBB*; 

/•  fitb  both  faoltf  coafiBfd  to  ttag*  a  or  ftag*  0  tb*  ESC  coBtiao**  to  •/ 

/*  foactloa;  tb*r*  1*  no  a**d  to  t*(t  ascb  faslt  pair*  for  lo**  of  PTIC  •/ 

oldlo**  ~  lo**: 
if  (  il'J  tb  I-'b)  < 

/•  if  tb*r*  ii  oa*  (tag*  a  faalt  th*B  baadl*  a*  a  (paeial  oaa*  */ 

bad (01. (tag*  =  J: 
bad[0].llBb  -  (frolaaCJ.b.a): 
badll] .(tag*  =  J : 

badlll.li&k  =  lotfrCj .x*rolB((j .b.a) .b) 
/•  badfOj  aad  badll]  bold  tb*  fault  l*b«l(  for  tb*  Boa-(tag*  b  box,  (bleb  */ 

/•  and  b*  ebackfd  to  (**  if  tb*j  block  aaj  (/d  patb;  fault  labala  for  tb*  */ 
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/•  itkg*  a  koa  ara  ealttad  la  this  eaaa  «/ 

kadt2].llBk  *  ttrolaaCl.a.a): 
bad(S].llak  z  leiar(l,iarelaa(l.a.a).a): 
/*  bad(2].llak  aad  bad[S].llak  art  atad  kart  to  hold  tha  addrtttaa  of  •/ 

/«  tha  tto  toareti  that  aatt  hart  oalj  thalr  prlaarj  patha  ehaekad  •/ 

bada  a  2: 

/•  thara  art  oalj  tao  faalt  labtla  daallag  altk  otkar  tkaa  a  ataga  a  boa  taalt  a/ 

taatfd): 

> 

tlat  < 

/*  caaa  of  ae  boa  faalta  la  ataga  a  */ 

bad(0].ataga  -  1; 
bad[0].llBk  a  itrolaad.a.a): 
bad[l].ttaga  =  1; 

bad(l].llak  a  loBtr(l.Btrolaa(l,a.a),a); 
bBd[2].ataga  a  j; 
bad[2].llak  a  ttrolaaCj.k.a); 
badlSJ.ataga  ■  j; 

badtSl.llak  a  loBar(J,aarolaa(J.h,a),a); 

kada  a  4; 

/• 

If  (aarboat) 

rrtatfCMt  (I2d.l2d)  •  (I2d.l2d) 

4  (124. 12d)  4  (l2d,l24)\aM.kadC0].llak.t.kad(t}  llak,J.bad[2].llak,J,bod[4].Uak); 

•/ 

taat(2) : 

> 

tf  (lota>ol41ott) 
bbloaa|«*; 


prlatf (‘laabar  BB-loatj  palra  a  IdVa*.  bbloaaj); 
prlatf Claabar  B8-faalt  palra  a  |d\a*.  bbfaalt); 


/•  LB-taalt  fair  aaetloa  «/ 
loaa  a  0; 

for  (  l=a  ;  1»0  ;  1--  )  { 

/•  for  all  atagaa  graatar  that  0  (thara  art  ao  LB-foalt  falra  altk  •/ 
/*  tha  'apparoeat  faalt*  la  ataga  0)  •/ 


for(aa0  ;  a<l/2  ;  a**  )  < 

/*  aataaa  tha  boa  fail!  la  la  ataga  1;  for  tha  *appar*  oatpat  of  aaok  boa  •/ 

for  (  J=1  ;  J»0  ;  j— )  < 

/•  for  all  atagaa  lata  tkaa  or  aqaal  to  1  aad  •/ 

/•  Craatar  tkaa  0  (thara  art  ao  ataga  0  llakt)  */ 

for  (  b=0  :  b<l;  b**  >  < 


m 


m 
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/*  For  oil  ootpotf  In  itng*  j  (nhleh  nunt  eontnln  th*  link  Innlt)  •/ 

Ikfnolt**; 

if  (  l“n  U  i=-B  Ul  bl^ttrolnnCi.n.n)  k/t  bl=loi*r(l.fnrolnn(i,n,tt),B)  > 
contlnnn ; 

/•  An  LB-fnnlt  ptlr  contnlnlnf  n  (tag«  n  box  fnolt  and  n  n  link  Innlt.  •/ 

/•  ihtrn  tbt  link  fnnlt  In  not  In  on*  of  thn  two  llnkn  nttncbnd  to  thn  •/ 

/*  fnnltj  box.  It  not  nn  LB-lontp  pair,  and  ntad  not  bn  tnntnd  */ 

oldlont  =  lonn: 

If  (  1"B  )  < 

/•  If  tbnrn  In  a  ntagn  a  box  fanlt  than  bandit  an  a  npnclal  cant  •/ 

bad[0]  nta|n  ^  j, 

bad[0].llnk  =  b: 

bad[2]  link  -  itrolnnCl.a.n) ; 

badtsl  llnk  :  lonnrCl.  inrolnnd.a.n)  ,n)  : 

/*  badt2].llnk  and  bad[S].llnk  art  nntd  btrt  to  bold  tbt  addrttttn  of  •/ 

/•  tbt  tno  noorcnn  that  nnnt  batt  onlj  tbtlr  prlnarj  patbn  cbtektd  •/ 

badn  -  1 ; 

/•  tbtrt  la  onlp  ont  fanlt  labtl  dtallnf  nltb  otbtr  than  a  ntngn  n  box  fanlt  */ 

tnnt(S) ; 


/•  Cant  of  no  ntagt  n  box  fanlt  •/ 

bad[0}.ntafn  =  1; 
badtOl.llnk  :  xnrolan (l.a.n) ; 
bad[ll  ntagn  =  1; 

badlll.llnk  t  lonnr (l.xnrolnnd.a.n) ,a) ; 
bad (2] .ntagn  t  j; 
bad[2]  link  :  b; 
badn  :  8; 

If  (tnrbonn) 

prlBtf(‘(»f  (S2d.l2d)  8  (l2d,S2d) 

8  (l2d.l2d)\B' , bad[0] . ntagn . bad (01  link. bad (1] . atagt . bad [ 1 1 .link, bad [2]  ntagt ,bad[2] . link) ; 

ttnt(2); 

) 

If  (lonn>oldlonn) 
lblonng»»; 

) 


for(a=0  ;  a<8  ;  a*"  )  < 

/*  Antnan  not  that  tbt  link  fanlt  It  In  ntagn  1;  for  nacb  ntagt  1  ontpnt  •/ 
/•  Tbnrn  can  bn  no  ntagt  n  box  fanlt  rtqnlrlng  nptclal  cant  trnatannt  In  «/ 
/*  tbln  Inntaaet  •/ 

for  (  J-1-1  .  ]>=0  ;  J— )  < 

/•  For  all  ntagtn  Innt  than  1  •/ 

for  (  b-0  ;  b<i/2.  b--  )  { 

/*  For  tbt  'nppnr*  ontpntn  of  tach  box  In  ntagt  }  •/ 
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•:m1k 


oldlMi  >  loti: 

ktd(0).tttft  B  1; 
kkd[0).llkk  «  t: 
kl4tll.tt*(t  B  J; 
ktdtD-llik  B  iiroiBi(j.k,k): 
kkdt2l.lttft  B  j; 

kkd(2).llBk  B  leitr(J,itrolBt(J.k,k),k); 

kkdl  B  •; 

It  (Ttrkot*) 

kriBtt(BM«  ai2d.l2d)  k  ai2d.l2d) 

t  (l2d.l2d)\a*,kBd[0l .tt>(B.ktd[0] .llik.kidtl] .tttcB,ktd[t] .li>k.ktdl2] .tttct,kkdt2] .llkk): 

tttt(2): 

It  (lait>oldloit) 

Iklottp-*: 


> 

prlBtt(*laakBr  LB-latt|  ptlrt  =  ld\t*.  Ikleitj): 
prlatf  (*luk«r  LB-fialt  ptlrt  b  |d\B*,  Iktaalt); 


/«  U.-f»«lt  pair  itellok  ♦/ 
left  B  0; 

for  (  l*t  ;  J»0  ;  1—  )  ( 

/•  for  all  ita(at  tltk  llkki  (l.t.,  taelkdlif  itifo  0)  •/ 
torCkBO  :  »<■  :  •**)  < 

/*  for  all  oatpatt  of  ttaft  i  */ 

for  (  jBl  ;  J>0  ;J--)  { 

/•  for  all  ita|tf  lata  that  or  aqaal  to  1  that  hart  llaki  •/ 

for  (  k=0  ;  k<l  ;  k*»  )  { 

/•  tor  all  oatpata  of  atafo  ]  */ 

It  (  k<Ba  U  IBBJ  ) 
eoatlaaa: 

/*  aalaii  thaa  tor  all  atafa  j  oatpata  graatar  that  a  •/ 

llfaalt**; 

/•  eoaat  aaakar  of  LL-faaltp  palra  •/ 

oldloaa  b  loaa; 
bad(0].ata(a  b  i; 
kad(0].liak  B  a: 
kad(l) .ataga  * 

badin.llak  B  b: 

bada  s  2; 

/•  It  (rarboaa) 


(l2d.l2d)\B*.l.a.J,b);  •/ 


prUtf(BM«  ai2d,l2d)  • 


taat(2}: 

If  (loaa  >  oldloaa) 


m- 


m 


m- 
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/•  cooat  BUbtr  of  LL-loitj  pBlri  */ 


) 

prlBtf  (‘iBabir  LL-loii;  pBlr*  =  lllottj); 

prlBlt('lBab«r  IX-lBnlt  pair*  -  Id\B*,  lltaalt); 


aiat  aaiktl  =  <0.0xl.0zS.0z7.0zf .Oxlf.OBSf.OxTf.Ozff .0xlff.0aSff.0<7ff.0Bfff. 
Oxlttt .OxSttt ,0x7ttt .Oitttt .Oxltttt .OiStttt .0x7tttt .Dittttt , 
0xittttf.0x3ttttt.0x7ttttt.0xtttttt.0xHttttt.0x3ttttlt.0x7lttttt. 
Ozfffffff .Ozlfffffff ,0zafffffff.0z7ftfffff>; 


t*ft(lBb«leod«) 

/• 

*•  Tilt  path*  thraa(k  tk«  aatzork  aatll  a  aoBrea/daatlaatloa  pair  that 
••  eaaaot  eoBBaalcata  la  foaad,  or  all  palra  bara  basB  taatad. 

••  Aaaaaaa  that  tba  bad  llaka  bara  alraadj  baaa  llatad  la  bad[*]. 

••  Af facia  flobal  rarlabla  loaa . 
aa 


••  Varlablaa: 

aa 

aa 

aa 

aa 

aa 


a/ 

lal  labalcoda: 

< 


a  aoarea 

d  daatiaatloa 

alaga  tba  aoabar  of  tba  ataga  to  ba  aaad  la  coapatlag  ataga 
oatpata  for  tba  a/d  pair  balag  taatad 
liak  eatpat  of  ataga  *ataga*  takaa  bp  a  path  of  tba  a/d  pair 
balag  taatad 

faaltl  flag  ladleatlag  tba  a/d  pair  baa  a  faaltp  prlaarp  path 
faaltS  flag  ladleatlag  tba  a/d  pair  baa  a  faaltp  aaeoadarp  path 
labalcoda  coda  aaad  b;  *taat*  to  proparlj  baadla  tba  caaa 

of  oaa  ataga  a  box  faolt.  Spacltlcallj,  tba 
llaka  froB  tba  faaltp  box  aaat  act  ba  traatad  a« 
If  tbaj  aara  faaltj  aad  tba  two  aooreaa  aalag 
tba  faaltp  box  aaat  bara  or.lj  tbalr  prlaarp  patba 
ekaekad  for  blockaga,  aa  tbap  bara  ao  aacoadarp 
path  arallabla 

2  -  BO  apaclal  baadllag  aaadad 
8  -  Lfi-faalt  altb  box  faalt  la  ataga  a 
4  =  BB-fanlt  altb  oaa  boa  faalt  la  ataga  a 


raglatar  lat  ataga.  faaltl.  faalt2; 
raglatar  alat  a,  d.  llak; 
for  (  a=0  ;  a<*  ;  a-"'  )  < 

for  (  d=0  ,  d<<  ;  d»»  )  { 


dlik. 
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faaltl  =  failtS  =  0; 

ter  (  atege^k  ;  ■t«ge>s0  :  etage —  )  < 

llak  =  (•taMk(etege])l(dt(*>aMk[«tigt]}); 

/•  eoepst*  itag*  oitpet  of  prlaarj  patk  •/ 

If  (  fuXtyCetege.llak)  )  < 
faaXtl*^: 
kreak: 

> 

) 

It  (  lakaleo<ia==3  II  (laktleoatl^S  U  a Ukad [2] .  llak  U  al=kadU]  llak)  )  < 
/*  Taat  aaeoadarj  patk  ealj  akia  appropriate,  l.e..  akaa  tka  aoarea  la  •/ 

/•  Bot  eoaaeeted  to  a  faaltp  atage  a  koa  */ 

for  (  atagezB  :  etage>0  ;  atage —  )  < 

llak  ‘  ((akaaaktatagal)l(dd(‘>Baak[atage])))  *  1; 

/•  eoapate  atage  oatpat  of  aeeoadarp  patk  •/ 

If  (  taaltp(atage.llBk)  )  { 
faalt3«*: 
kreak; 

> 

> 

> 

alee 

faalt2'*'‘; 

/*  it  tka  aeeoadarp  patk  akoald  aot  ke  teated,  aat  taaltk  aa  If  tka  */ 

/•  aaeoadarj  tare  faaltj  ao  tkat  if  tka  prlaarj  la  foaad  to  ka  faaltj,  •/ 

/«  tka  aariakla  *leaa*  till  ka  lacraaaatad,  eorraetlj  ladleatlag  aa  */ 

/•  a/d  pair  tkat  eaaaot  eoaaaaieata  •/ 

If  (  aarkoaa  )  < 

priatf(«l2d  to  13d  *.  a.  d): 
prlatfC*  prlaarj:  •): 

lf(faaltl>0)  prlatf ('taaltj  at  atage  Id,*, atage); 
alee  prtatf (*okaj.*); 
priattC  aaeoadarj;  '); 

if(faalt3>3)  prlatf ('faaltj  at  atage  Id,', atage); 
alaa  priatf  Cokaj.*); 

prlattC  la\B*,  (faeltUO  it  taBlt3>0)T'UBS*;*ICL0SS'); 

> 

it  (  faBltl>0  U  faBlt3>0  ) 
loaa**; 


f aoltj (atage. liak) 

/« 

••  Ckack  a  atage  oatpat  agaiaat  tka  Hat  of  faalt  labala. 
••  If  tkere  la  a  aatck.  ratari  a  1,  alee  retara  0. 


m 


m 
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alBt 

{ 


lat  1: 

tor  (  1=0  ;  l<h»4»  :  l'*»  ) 

if  (  ■t»*.==b»<J[ll.»t.*«  M  llnk==b»d[l).ll»k  ) 


ratora  I; 


ratora  0; 


loaarfitafa.llBk.B) 

/• 

aa  Caaarata  tha  corraapoadlaj  ‘loiar*  box  oatpat  labal 
aa  giraa  tha  ataga  aaabar,  Xabal  ol  tba  -appar*  oatpat, 
aa  ud  tba  aaabar  of  atagaa 
•/ 

alat  ataga.  Mab.  a: 

< 

if  (itaga  ==  a) 

rataraCllak  ■“  1) ; 


rataraCllak  *  (1  <<  ataga)); 


taroiaa (ataga . llak, a) 

/• 

aa  Caaarata  tha  'appar*  box  oatpat  labal  giraa  tha  aaabar  of  tba 
aa  ataga  eoatalalag  tba  box,  tha  •aaabar*  of  tba  box  altbla  that 
aa  ataga  (eoaatlag  froa  laro  with  tba  'appar*  box  la  tba  ataga), 
aa  aad  tha  aaabar  of  atagaa 
•/ 

alat  ataga.  link,  a; 

( 

alat  appar.  loaar; 
it  (ataga  -=  a) 

ataga  =  0: 

appar  =  (link  »  (-0  «  ataga))  «  1; 
loaar  =  link  t  -(-0  <<  ataga); 
ratnrB(  appar  I  loaar  ), 
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